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Editorial 

Message from Editorial Board 


It is our great pleasure to present the December 2017 issue (Volume 15 Number 12) of the 
International Journal of Computer Science and Information Security (IJCSIS). High quality 
research, survey & review articles are proposed from experts in the field, promoting insight and 
understanding of the state of the art, and trends in computer science and technology. It especially 
provides a platform for high-caliber academics, practitioners and PhD/Doctoral graduates to 
publish completed work and latest research outcomes. According to Google Scholar, up to now 
papers published in IJCSIS have been cited over 9800 times and this journal is experiencing 
steady and healthy growth. Google statistics shows that IJCSIS has established the first step to 
be an international and prestigious journal in the field of Computer Science and Information 
Security. There have been many improvements to the processing of papers; we have also 
witnessed a significant growth in interest through a higher number of submissions as well as 
through the breadth and quality of those submissions. IJCSIS is indexed in major 
academic/scientific databases and important repositories, such as: Google Scholar, Thomson 
Reuters, ArXiv, CiteSeerX, Cornell’s University Library, Ei Compendex, ISI Scopus, DBLP, DOAJ, 
ProQuest, ResearchGate, Academia.edu and EBSCO among others. 

A great journal cannot be made great without a dedicated editorial team of editors and reviewers. 
On behalf of IJCSIS community and the sponsors, we congratulate the authors and thank the 
reviewers for their outstanding efforts to review and recommend high quality papers for 
publication. In particular, we would like to thank the international academia and researchers for 
continued support by citing papers published in IJCSIS. Without their sustained and unselfish 
commitments, IJCSIS would not have achieved its current premier status, making sure we deliver 
high-quality content to our readers in a timely fashion. 

“We support researchers to succeed by providing high visibility & impact value, prestige and 
excellence in research publication. ” We would like to thank you, the authors and readers, the 
content providers and consumers, who have made this journal the best possible. 

For further questions or other suggestions please do not hesitate to contact us at 

iicsiseditordcbcimail. com . 

A complete list of journals can be found at: 

http://sites.qooqle.com/site/iicsis/ 
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Abstract — This Research Paper, explored Risk management from three perspectives: Risk Assessment, Risk 
Mitigation & Continuous Risk monitor, evaluate and Review. Risk Assessment Perspective highlights 
vulnerability/threat sources, types of Security, methods used to “break” security . The assessment processes were fully 
depicted showing its specific steps. It also reflected importance of consistent communication and consultation among the 
stakeholders as it is key and sustained throughout the entire processes. This work also strengthens the fact that risk is not 
entirely negative as reveals by its SWOT (Strength, Weakness, Threat and Opportunity) analysis . Hence one can build 
on the opportunity thereby reducing the associated threat and also improve on the strengths . This invariably reduces 
and if possible eradicates the weaknesses. 

Keywords — Risk, Risk Management, Risk Assessment, Vulnerability, threat, Opportunity 


I. Introduction 

In the face of global ever-dynamic threats and attacks, every Organization is adopting measures to reduce negative risks 
and utilize positive risks. This ensures that her vision and mission is protected, guarded and fully enhanced. This critical 
as Organizations make ICT, hub for better support and sustenance of her business. As Organizations automated their 
processes using data and communication devices, Risk Management plays a very critical role in protecting the 
organizations information assets, and therefore its mission. Risk Management is every stakeholder's duty and not only 
for the technical IT team. Therefore, it should be treated as fundamentally as an essential role of the Management. An 
effective and efficient risk management process is an important component of a successful ICT security so as to ensure 
data confidentiality, integrity and high availability. Vulnerability is the devices' weakness which can be accidentally 
triggered or intentionally exploited. While opportunity is just positive risk which can be invested upon in order to 
maximize the use and benefit the devices. Based on this, Organizations are continuously working on reducing the 
vulnerabilities by minimizing sources of threats and maximizing the opportunities by strengthening the securities during 
SWOT (Strength-weakness-Opportunities and threats) analysis. Risk management is the process of identifying risk, 
assessing risk, and taking steps to reduce risk to an acceptable level, if possible eradicate it completely. The objectives of 
Risk assessment in this work are to increase the likelihood and impact of positive events, and decrease the likelihood of 
negative events in devices. However, before now, Risk management is not consciously or transparently carried out for 
data and communication devices as the “practice of the day” is that the Organization's perceived final step in the system 
development life cycle of the devices is always junking of the devices without final proper risk assessment to ensure that 
no critical piece of information or data can be intentional or accidentally exploited. 
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II. VULNERABILITIES/THREAT SOURCES 

Risks are continuously evolving. The goal of this step is to identify the potential vulnerability and threat-sources that are 
applicable to the ICT systems being evaluated. A threat-source is defined as any circumstance or event with the potential 
to cause harm to an ICT system. The common threat-sources can be natural, human, or environmental: 

• Natural Threats: Floods, earthquakes, tornadoes, landslides, avalanches, electrical storms, and other such events. 

• Human Threats: Events that are either enabled by or caused by human beings, such as unintentional acts 
(inadvertent data entry) or deliberate actions (network based attacks, malicious software upload, unauthorized 
access to confidential information). 

• Environmental Threats: Long-term power failure, pollution, chemicals, liquid leakage. 

In assessing threat-sources, it is important to consider all potential threat-sources that could cause harm to an ICT System 
and its processing environment. For example, although the threat statement for an ICT System located in a desert may 
not include natural flood, because of the low likelihood of such an events occurring, environmental threats such as a 
bursting pipe can quickly flood a computer room and cause damage to an organization’s IT assets and resources. 

Humans can be threat-sources through intentional acts, such as deliberate attacks by malicious persons or disgruntled 
employees, or unintentional acts, such as negligence and errors. 

A deliberate attack can be either: 

• A malicious attempt to gain unauthorized access to an ICT System (e.g., via password guessing) in order to 
compromise system and data integrity, availability, or confidentiality or 

• A benign, but nonetheless purposeful, attempt to circumvent system security. 

Motivation and Threat Actions 

According to Siciliano, (2011) Hackers are motivated by a number of factors such as ego, religion, politics, activism etc. 
Motivation and the resources for carrying out an attack make humans potentially dangerous threat-sources. In addition, 
reviews of the history of system break-ins; security violation reports; incident reports; and interviews with the system 
administrators, help desk personnel, and user community during information gathering help identify human threat- 
sources that have the potential to harm an IT system and its data and that may be a concern where vulnerability exists. 

With these information, organizations should be mindful of them and consciously have proofs against them to reduce 
and/or prevent successful exploits. 

Types of Security 

Secure communication is when two or more devices are communicating without eavesdropping or interception by a third 
party. This communication involves sharing of data and information with varying confidentiality and integrity. Among 
the means to achieve this is: 

• Code: This is a means whereby the content and nature of communication is hidden. It is a rule to convert a piece 
of information and data (for example, a letter, word, phrase or gesture) into another form of representation, not 
necessarily of same type. 

• Encryption: This is also another means whereby the nature and content of communication is hidden. Here, data 
and communication is rendered hard to read to any unauthorized party. In some highly security-conscious 
environments, encryption is configured such that it is a basic requirement for connection and communication to 
be established. No room for opportunistic encryption which is a lower security method to generally increase 
percentage of generic traffic and this makes the content susceptible to eavesdropping. 
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• Steganography: This is sometimes referred to as “hidden writing” in which data can be hidden within another, 
mostly innocuous data. In this way, it is difficult to find or remove unless you know how to find it.For example 
in communication, the hiding of important data such as telephone number in apparently innocuous data (an MP3 
music file). A good advantage of this is plausible deniability - unless one can prove that the data is there (which 
is usually not easy), it is deniable that the file contains any. 

• Identity based Networks: Unwanted or malicious behaviour is possible on the web since it is inherently 
anonymous. Identity based network removes the chance of anonymity as the identity of the sender and recipient 
are known. 

• “Security by Obscurity: Similar to needle in a haystack in which secrecy of design or implementation is used to 
provide security. Though this is discouraged and not recommended by standard bodies. But stakeholders believe 
that if the flaws are not known, then attackers will be unlikely to find them. As it is known that attacker's first 
step is usually information gathering which is delayed by this. 

• Random Traffic: This involves creating random data flow to make the presence of genuine communication 
harder to detect and traffic analysis less reliable. 

• Hard to trace routing methods: This method hides the parties involved in a communication through unauthorized 
third-party systems or relays. 

Methods used to “break” security 

• Bugging: This is simply known as covert listening device which involves miniature transmitter and microphone. 
This enables unauthorized parties to listen to conversation. 

• Computers (general): Any security obtained from a computer is limited by the many ways it can be 
compromised - by hacking, keystroke logging, backdoors or even in extreme cases by monitoring the tiny 
electrical signals given off by keyboard or monitors to reconstruct what is typed or seen. 

• Laser audio Surveillance: Sounds including speech inside rooms can be sensed by bouncing a laser beam off a 
window of the room where a conversation is held and detecting and decoding the vibrations in the glass caused 
by the sound waves. 

• Spoofing: This is a situation in which one person or program successfully masquerades as another by falsifying 
data and thereby gaining an illegitimate advantage or access. For example, Caller Id, Email address, IP address 
etc. can all be spoofed. 

III. Risk assessment processes 

Risk Management involves three main principal processes: risk assessment, risk Mitigation and, Monitor, evaluate and 

review. Risk Assessment has many steps (see figure 1) with processes as broadly summarized: 

• Risk identification: This allows individuals to identify risks so that the stakeholders will be in the know of 
potential problems inherent in the devices. It is pertinent to start this stage as early as possible and should be 
repeated frequently. 

• Risk analysis and Priority: Risk analysis transforms the estimates or data about specific risks that developed 
during risk identification into a consistent form that can be used to make decisions around prioritization. 
Risk prioritization enables operations to commit resources to manage the most important risks. 

• Risk register (Statements) integration: This is the result of risk assessment process. It is a document which 
contains lists of identified risks, root causes of risks, lists of potential responses, risk owners, symptoms and 
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warning signs, relative rating or priority list. Risk for additional analysis and responses, and a watch list 
which is a list of low-priority risk within the risk register. 

• Consistent Communication and Consultation: There is steady communication among stakeholders within the 
organization as everyone is practically involved. In addition to this, the stakeholders can consult the 
manufacturer of the device through any of the appropriate channels such as through their Representative or 
customer voices. This ensures speedy and reliable Response. 


Fijiire I : RISK ASSESSMENT PROCESSES CYCLE 
INPUT 


OUTPUT 



The processes are followed up with consistent updates and awareness campaigns among stakeholders as new challenges, 
discoveries and prospects arise. 
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Figure II: SIMPLIFY OVERALL RISK ASSESSMENT PROCESS FLOWCHAT FOR ICT SYSTEM 



(Adopted from Stoneburner et al) 


Principal stakeholders, knowing the potential risks and recommended controls, may ask, “When and under what 
circumstances should I take action? When shall I implement these controls to mitigate the risk and protect our 
organization?” Below system architecture is the answer. The system architecture is further articulated in the following 
rules of thumb, which provide guidance on actions to mitigate risks from intentional human threats: 

• When vulnerability (or flaw, weakness) exists —► implement initial risk assessment to reduce the likelihood of a 
vulnerability’s being exercised. 

• When vulnerability can be exercised —► apply layered protections, architectural designs, and administrative 
controls to minimize the risk or prevent the occurrence. 

• When the attacker’s cost is less than the potential gain —► apply protections to decrease an attacker’s motivation 
by increasing the attacker’s cost (e.g., use of system controls such as limiting what a system user can access and 
do can significantly reduce an attacker’s gain), communicate and update the risk registers. 

• When loss is too great —► apply design principles, layered architectural designs, and technical and nontechnical 
protections to limit the extent of the attack, thereby reducing the potential for loss, update the risk register and 
communicate. 
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IV. Conclusions 

One of the main aims of Risk Management process analysis for ICT systems: Risk Assessment perspective is to ensure 
that the devices are deployed with best of security measures in place, which makes the organization to be fully proactive 
rather than reactive as it is today with many organizations. This will, invariably increase stakeholders ' risk appetite for 
positive risks and of course, establish a careful risk threshold for negative risks. This further ensures that cost of attack 
from a potential intentional attacker is far higher than the anticipated gain which would likely discourage the attackers. A 
successful attack has high currency impact, loss of customer confidence and negative business reputation. It assists 
management to make well-informed risk management decisions to justify huge capital expenditures that are part of an 
ICT budget and also in authorizing the ICT devices on the basis of the supporting documentation resulting from the 
performance of risk management. With the flowchart, stakeholders are convinced of what to do, hence proactively take 
the right steps/decision to protect the organization and ensure optimal utilization of the ICT systems. 

It is worthy of note here that the process is a continuous one in order to get optimal throughput from the devices with 
little or no down time as a result of attack, vulnerability or negative risks .Thereby increasing the opportunities which the 
devices can offer. Therefore Risk management continues even at the final stage of systems development Life cycle which 
is disposal of the devices. It is pertinent to carry out risk assessment at the disposal stage to ensure sensitive data or 
information are not left out such as vital configurations which may include plain-text passwords, administrator 
credentials etc. 
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ABSTRACT 

The evolution of application fields and their sophistication have 
influenced research topics in the robotics community. This evolution 
has been dominated by human necessities. In the early 1960s, the 
industrial revolution put industrial robots in the factory to release the 
human operator from risky and harmful tasks. The later incorporation 
of industrial robots into other types of production processes added new 
requirements that called for more flexibility and intelligence in 
industrial robots. 

Keywords: Robotics, artificial intelligence, algorithm. 

Introduction 

During the last 45 years, robotics research has been aimed at finding 
solutions to the technical necessities of applied robotics. The evolution 
of application fields and their sophistication have influenced research 
topics in the robotics community. This evolution has been dominated 
by human necessities. In the early 1960s, the industrial revolution put 
industrial robots in the factory to release the human operator from risky 
and harmful tasks. The later incorporation of industrial robots into 
other types of production processes added new requirements that called 
for more flexibility and intelligence in industrial robots. Currently, the 
creation of new needs and markets outside the traditional 
manufacturing robotic market (i.e., cleaning, demining, construction, 
shipbuilding, agriculture) and the aging world we live in is demanding 
field and service robots to attend to the new market and to human 
social needs. 

This article addresses the evolution of robotics research in three 
different areas: robot manipulators, mobile robots, and biologically 
inspired robots. Although these three areas share some research topics, 
they differ significantly in most research topics and in their application 
fields. For this rea-son, they have been treated separately in this survey. 
The section on robot manipulators includes research on indus-trial 
robots, medical robots and rehabilitation robots, and briefly surveys 
other service applications such as refueling, picking and palletizing. 
When surveying the research in mobile robots we consider terrestrial 
and underwater vehi-cles. Aerial vehicles are less widespread and for 
this reason have not been considered. Biologically inspired robots 
include mainly walking robots and humanoid robots; how-ever, some 
other biologically inspired underwater systems are briefly mentioned. 
In spite of the differences between robot manipulators, mobile robots 
and biologically inspired robots, the three research areas converge in 
their current and future intended use: field and service robotics. With 


the modernization of the First World, new services are being 
demanded that are shifting how we think of robots from the 
industrial viewpoint to the social and personal viewpoint. Society 
demands new robots designed to assist and serve the human being, 
and this harks back to the first origins of the concept of the robot, 
as transmit-ted by science fiction since the early 1920s: the robot 
as a human servant (see Figure 1). Also, the creation of new needs 
and markets outside the traditional market of manufacturing 
robotics leads to a new concept of robot. A new sector is therefore 
arising from robotics, a sector with a great future giving service to 
the human being. Traditional industrial robots and mobile robots 
are being modified to address this new market. Research has 
evolved to find solutions to the technical necessities of each stage 
in the development of service robots. 


Figure L ASIMO. Photograph courtesy of American Honda 
Motor Co. 


Robot Manipulators 

A robot manipulator, also known as a robot arm, is a serial chain 
of rigid limbs designed to perform a task with its end-effector. 
Early designs concentrated on industrial manipulators, to perform 
tasks such as welding, painting, and palletizing. The evolution of 
the technical necessities of society and the technological advances 
achieved have helped the strong growth of new applications in 
recent years, such as surgery assistance, rehabilitation, automatic 
refuelling, etc. This section surveys those areas that have received 
a special, concentrated research effort, namely, industrial robots, 
medical robots, and rehabilitation robots. 
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Industrial Robots 

It was around 1960 when industrial robots were first introduced in the 
production process, and until the 1990s industrial robots dominated 
robotics research. In the beginning, the automotive industry dictated the 
specifications industrial robots had to meet, mainly due to the 
industry’s market clout and clear technical necessities. These 
necessities determined which areas of investigation were predominant 
during that period. 

One such area was kinematic calibration, which is a necessary 
process due to the inaccuracy of kinematic models based on 
manufacturing parameters. The calibration process is carried out in four 
stages. The first stage is mathematical modeling, where the Denavit- 
Hartenberg (DH) method and the product-of-exponential (POE) 
formulation lead the large family of methods. A detailed discussion of 
the fundamentals of kinematic modeling can be found in the literature 
[1]. The gap between the theoretical model and the real model is found 
in the second stage by direct measurement through sensors. Thus, the 
true position of the robot’s end effector is determined, and by means of 
optimization techniques, the parameters that vary from their nominal 
values are identified in the third stage. Last, implementation in the 
robot is the process of incorporating the improved kinematic model. 
This process will depend on the complexity of the machine, and 
iterative methods will have to be employed in the most complex cases. 
Research in robot calibration remains an open issue, and new methods 
that reduce the computational complexity of the calibration process are 
still being proposed [2], [3]. 

Another important research topic is motion planning, wherein 
subgoals are calculated to control the completion of the robot’s task. In 
the literature there are two types of algorithms, implicit methods and 
explicit methods. Implicit methods specify the desired dynamic 
behavior of the robot. One implicit scheme that is attractive from the 
computational point of view is the potential field algorithm [4]. One 
disadvantage of this approach is that local minima of the potential field 
function can trap the robot far from its goal. Explicit methods pro-vide 
the trajectory of the robot between the initial and final goal. Discrete 
explicit methods focus on finding discrete collision-free configurations 
between the start and goal configurations. These methods consist 
mainly of two classes of algorithms, the family of road-map methods 
that include the visibility graph, the Voronoi diagram, the free-way 
method and the Roadmap algorithm [5], and the cell-decomposition 
methods [6]. Continuous explicit methods, on the other hand, consist in 
basically open-loop control laws. One important family of methods is 
based on optimal-control strategies [7], whose main disadvantages are 
their computational cost and dependence on the accuracy of the robot’s 
dynamic model. 

Besides planning robot motion, control laws that assure the 
execution of the plan are required in order to accomplish the robot’s 
task. Thus, one fundamental research topic focuses on control 
techniques. A robot manipulator is a nonlinear, multi-variable system 
and a wide spectrum of control techniques can be experimented here, 
ranging from the simpler proportional derivative (PD) and proportional 
integral derivative (PID) control to the computed-torque method [8], 
and the more sophisticated adaptive control [9] whose details are out of 
the scope of this survey. 

Typical industrial robots are designed to manipulate objects and 
interact with their environment, mainly during tasks such as polishing, 
milling, assembling, etc. In the control of the interaction between 
manipulator and environment, the contact force at the manipulator’s 
end effector is regulated. There are diverse schemes of active force 
control, such as stiffness control, compliant control, impedance control, 
explicit force control and hybrid force/position control. 


The first three schemes belong to the category of indirect force 
control, which achieves force control via motion control, while the 
last two methods perform direct force control by means of explicit 
closure of the force-feedback loop. Readers who wish to study this 
subject in detail will find an interesting account in [10]. 

An attractive alternative for implementing force-control laws is the 
use of passive mechanical devices so that the trajectory of the 
robot is modified by interaction forces due to the robot’s own 
accommodation. An important example of passive force control is 
the remote center of compliance (RCC) sys-tem patented by 
Watson in 1978 [11] for peg-in-hole assembly. Passive force 
control is simpler than active force control laws but has 
disadvantages, such as lacking flexibility and being unable to 
avoid the appearance of high contact forces. 



Figure 2. Robots in the food industry. 

As 1990 began, new application areas for industrial robots 
arose that imposed new specifications, with flexibility as the 
principal characteristic. The new industries that introduced 
industrial robots in their productive process were the food and 
pharmacy industries (see Figure 2). Postal services too looked 
for robotic systems to automate their logistics. The main 
requirement was the capacity to accommodate variations in 
product, size, shape, rigidity (in the case of foods), etc. The 
ability to self-adapt to the product and the environment became 
the issue in the following lines of investigation in the area of 
industrial robotics. The main line of research now is aimed at 
equipping the control system with sufficient intelligence and 
problem-solving capability. This is obtained by resorting to 
artificial-intelligence techniques. Different artificial 
intelligence (AI) techniques are used to provide the robot with 
intelligence and flexibility so it can operate in dynamic 
environments and in the presence of uncertainty. Those 
techniques belong to three areas of artificial intelligence: 
learning, reasoning and problem solving [12]. Among the 
diverse learning algorithms, inductive learning is the most 
widely used in robotics, in which the robot learns from 
preselected examples 
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Typical reasoning paradigms in robotics include fuzzy reasoning [14], 
mostly used in planning under uncertainty, spatial reasoning, and 
temporal reasoning. The techniques most commonly used in robotics 
for problem solving are means-end reasoning, heuristic searching, and 
the blackboard (BB) model. 

Another solution to the control of robots in dynamic or unknown 
environments consists of introducing the operator in the control 
loop, such that the robot is remotely operated. The success of a 
teleoperation system relies on the correct feedback of the robot 
interaction with the environment, which can be visual, tactile or 
force reflection. The greatest disadvantage that teleoperated 
systems involve are transmission delays when the distance between 
the operator and the robot is significant, like in space teleoperation 
or over the Internet. Some research has explored solutions to this 
modified to respond to this new market, yielding surgery 
robots, refueling robots, picking and palletising robots, feeding 
robots, rehabilitation robots, etc. Two of the most relevant 
service applications of robot manipulators are in the field of 
medical robots and rehabilitation robots that are catching the 
interest of researchers all over the world. In the following sub¬ 
sections, we will summarize research topics in medical robot¬ 
ics and rehabilitation robotics. 

Medical Robots 

In recent years, the field of medicine has been also invaded by 
robots, not to replace qualified personnel such as doctors and 
nurses, but to assist them in routine work and precision tasks. 
Medical robotics is a promising field that really took off in the 
1990s. Since then, a wide variety of medical applications have 
emerged: laboratory robots, telesurgery, surgical training, remote 
surgery, telemedicine and teleconsultation, rehabilitation, help for 
the deaf and the blind, and hospital robots. Medical robots assist 
in operations on heart-attack victims and make possible the 
millimeter-fine adjustment of prostheses. There are, however, 
many challenges in the widespread imple-mentation of robotics in 
the medical field, mainly due to issues such as safety, precision, 
cost and reluctance to accept this technology. 

Medical robots may be classified in many ways: by 
manipulator design (e.g., kinematics, actuation); by level of 
autonomy (e.g., preprogrammed versus teleoperation versus 
constrained cooperative control); by targeted anatomy or 
technique (e.g., cardiac, intravascular, percutaneous, laparoscopic, 
micro-surgical); by intended operating environment [e.g., in¬ 
scanner, conventional operating room (OR)], etc. Research 
remains open in the field of surgical robotics, where extensive 
effort has been invested and results are impressive. Some of the 
key technical barriers include safety [16], where some of the basic 
principles at issue are redundancy, avoiding unnecessary speed or 
power in actuators, rigorous design analysis and multiple 
emergency stop and checkpoint/restart facilities. Medical human- 
machine interfaces are another key issue that draws upon 
essentially the same technologies as other application domains. 


problem, such as interposing a virtual robot in charge of 
environment feedback, but this procedure is only valid if the 
robot works in structured environments. Another solution is 
teleprogramming, in which the operator sends high-level 
commands and the robot carries out the task in closed-loop 
control. Recently, considerable attention has been devoted to 
Internet-based teleoperation, in which the transmission delay is 
variable. For direct force feedback, wave-variable-based 
approaches have been used extensively, and they have been 
further extended to include estimation and prediction of the 
delay. A comprehensive survey can be found in [15]. 

With the rapid modernization of the First World, new types 
of services are being required to maintain a certain quality of 
life. A new, promising robotics sector is arising to serve the 
human being. Traditional industrial robots are being 

Surgeons rely on vision as their dominant source of feedback; 
however, due to the limited resolution of current-generation video 
cameras, there is interest in optical overlay methods, in which 
graphic information is superimposed on the surgeon’s field of 
view to improve the information provided [17]. As surgeons 
frequently have their hands busy, there has been also interest in 
using voice as an interface. Force and haptic feedback is another 
powerful interface for telesurgery applications [18]. Much of the 
past and present work on telesurgery involves the use of master- 
slave manipulator systems [19], [20]. These systems have the 
ability to feed forces back to the surgeon through the master 
manipulator, although slaves’ limitations in sensing tool-to-tissue 
forces can some-what reduce this ability. 

The field of medical robotics is expanding rapidly and results 
are impressive as a large number of commercial devices are 
being used in hospitals. However, societal barriers have to be 
overcome and significant engineering research effort is 
required before medical robots have wide-spread impact on 
health care. 

Rehabilitation Robots 

Activity in the field of rehabilitation robotics began in the 1960s 
[21] and has slowly evolved through the years to a point where 
the first commercially successful products are now available. 
Today, the concept of “rehabilitation robot” may include a wide 
array of mechatronic devices ranging from artificial limbs to 
robots for supporting rehabilitation therapy or for providing 
personal assistance in hospital and residential sites. Examples 
include robots for neuro-rehabilitation [22], power-augmentation 
orthosis [23], rehabilitative orthosis, etc. The field of 
rehabilitation robotics is less developed than that of industrial 
robotics. Many assistive robotic systems have featured an 
industrial robot arm for reasons of economy and availability [24]. 
However, the specifications for robots in these two application 
areas are very different. The differences arise from the 
involvement of the user in rehabilitation applications. Industrial 
robots are typically powerful and rigid to provide speed and 
accuracy. They operate autonomously and, for reasons of safety, 
no human interaction is permitted. 
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Rehabilitation robots must operate more slowly and be more 
compliant to facilitate safe user interaction. Thus, rehabilitation 
robotics is more akin to service robotics, which integrates humans 
and robots in the same task. It requires safety and special attention 
must be paid to human-machine interfaces that have to be adapted 
for disabled or nonskilled people operating a specific 
programming device. It is also recognized that there is a need for 
research and development in robotics to focus on developing 
more flexible systems for use in unstructured environments. The 
leading developments of this type in rehabilitation robotics 
concern, among other topics, mechanical design (including 
mobility and end-effectors), programming, control and man 
machine interfaces [25]. Subsection “Humanoid Robots” of this 
article expands on new research into human-robot interaction. 

Mobile Robots 

The term mobile robot describes a robotic system able to carry 
out tasks in different places and consisting of a platform moved 
by locomotive elements. The choice of the locomotive system 
depends firstly on the environment in which the robot will 
operate. This can be aerial, aquatic or terrestrial In the aquatic 
and aerial environments, the locomotive systems are usually 
propellers or screws, although at the seabed legs are also used. 
The choice of the locomotive system on earth is more 
complicated due to the variety of terrestrial environments. 
Wheels, tracks, and legs are typical terrestrial locomotive 
elements. 

Mobility provides robots with enhanced operating capacity 
and opens up new areas of investigation. Some such areas are 
common to all mobile robots, like the navigation problem, 
whereas others deal more specifically with a certain locomo¬ 
tion system, like the walking gait. 

Practically by the time industrial robots were introduced in the 
production process, mobile robots were installed in the factory. 
This was around 1968, and the robots were mainly automated 
guided vehicles (AGVs), vehicles transporting tools and 
following a predefined trajectory. Nevertheless, the research in 
this area deals now with autonomous indoor and outdoor 
navigation. Autonomous mobile-robot navigation consists of four 
stages: perception of the environment, self-localization, motion 
planning and motion generation. 

In structured environments, the perception process allows maps or 
models of the world to be generated that are used for robot 
localization and motion planning. In unstmctured or dynamic 
environments, however, the robot has to team how to navigate. 
Navigation is, therefore, one of the main applications of artificial 
intelligence to robotics, where learning, reasoning and problem 
solving come together. The main research in mobile robotics is 
focusing on robot localization and map generation. 

Conclusion 

Since the introduction of industrial robots in the automotive industry, 
robotics research has evolved over time towards the development of 
robotic systems to help the human in dangerous, risky or unpleasant 
tasks. As the complexity of tasks has increased, flexibility has been 


demanded in industrial robots, and robotics research has veered 
towards adaptive and intelligent systems. 

Since 1995, robotics research has entered the field- and service- 
robotics world, where we can find manipulators, mobile robots and 
animal-like robots with great perspectives of development and 
increasing research interest. Surgical robots have been the first 
successes, and recently different areas in medical-and rehabilitation- 
robotics applications have arisen. Other examples can be found in the 
fields of home cleaning, refueling and museum exhibitions, to name 
just a few areas. 

Service-robotics research is also aimed at providing a 
comfortable, easy life for the human being in an aging world. The 
United Nations Economic Commission for Europe (UNECE) 
forecasts strong growth of professional robots in application areas 
such as humanoid robots, field robots, underwater systems and 
mobile robot platforms for multiple use in the period of 2005-2008 
[86]. The UNECE also forecasts a tremendous rise in personal robots 
in the next few years. Robotics research has to make a great effort to 
solve in very few years the challenges of this new field of research, 
which will be largely determined by interaction between humans and 
robots. Figure 10 summarizes the evolution of robotics research over 
the last 50 years. 

It is a fact that, during the last decade, the activity in conferences 
and expositions all over the world has reflected low activity in 
industrial manipulators and huge activity in other areas related with 
manipulation in unstructured environments and mobility, including 
wheeled, flying, underwater, legged and humanoid robots. Maybe the 
key is that new challenges in manipulation in factories require less 
research now because factory needs lie in the field of traditional 
engineering. 

With these premises we can conclude: Yes, definitely robotics 
research is moving from industrial to field and service applications, 
and most robotics researchers are enthu-siastic about this broad, 
exciting field. One development that is very representative of the way 
the field is evolving is the controversy set off by Prof. Engelberger, 
the creator of the first robotics company, at the 2005 International 
Robot Exhibition in Tokyo, Japan, when he commented on the 
needless research by both Japanese companies and scientific 
institutions for developing toy-like animal and humanoid robots for 
very doubtful use. Engelberger thus gained many detractors, who 
have rapidly argued back that these kinds of robots are a necessary 
step in the evolution towards real robots capable of helping disabled 
persons, performing dan-gerous work and moving in hazardous 
places. 

Other defenders of the development of human-like per-sonal 
robots advocate the importance of aiming at such chal-lenging tasks 
because of the technology that can be developed, which would prove 
very important from the commercial point of view in other industrial 
activities. 

Maybe behind all the arguments there still lies the human dream 
of the universal robot—a single device that can per-form any task. 
Nothing better for that than a device resembling—what else?—a 
human being. So, let our imagination flies into the world of service 
robotics, but, please, do not for-get to keep an eye on traditional 
industrial manipulators. 
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ABSTRACT 

Knowledge Discovery in Databases is the process of searching 
for hidden knowledge in the massive amounts of data that we 
are technically capable of generating and storing. Data, in its 
raw form, is simply a collection of elements, from which little 
knowledge can be gleaned. With the development of data 
discovery techniques the value of the data is significantly 
improved. A variety of methods are available to assist in 
extracting patterns that when interpreted provide valuable, 
possibly previously unknown, insight into the stored data. This 
information can be predictive or descriptive in nature. Data 
mining, the pattern extraction phase of KDD, can take on 
many forms, the choice dependent on the desired results. KDD 
is a multi-step process that facilitates the conversion of data to 
useful information.Our increased ability to gain information 
from stored data raises the ethical dilemma of how the 
information should be treated and safeguarded. 

Keywords 

Knowledge Discovery Databases, Data Mining, Knowledge 
Mining 


1. INTRODUCTION 

The desire and need for information has led to the 
development of systems and equipment that can generate and 
collect massive amounts of data. Many fields, especially those 
involved in decision making, are participants in the 
information acquisition game. Examples include: finance, 
banking, retail sales, manufacturing, monitoring and diagnosis, 
health care, marketing and science data acquisition. Advances 
in storage capacity and digital data gathering equipment such 
as scanners, has made it possible to generate massive datasets, 
sometimes called data warehouses that measure in terabytes. 
For example, NASA's Earth Observing System is expected to 
return data at rates of several gigabytes per hour by the end of 
the century (Way, 1991). Modem scanning equipment record 
millions of transactions from common daily activities such as 
supermarket or department store checkout-register sales. The 
explosion in the number of resources available on the World 


Wide Web is another challenge for indexing and searching 
through a continually changing and growing "database." 

Our ability to wade through the data and turn it into 
meaningful information is hampered by the size and 
complexity of the stored information base. In fact, the shear 
size of the data makes human analysis untenable in many 
instances, negating the effort spent in collecting the data. 
There are several viable options currently being used to assist 
in weeding out usable information. The information retrieval 
process using these various tools is referred to as Knowledge 
Discovery in Databases (KDD). 

"The basic task of KDD is to extract knowledge (or 
information) from lower level data (databases) (Fayyad et al , 
1995). There are several formal definitions of KDD, all agree 
that the intent is to harvest information by recognizing patterns 
in raw data. Let us examine definition proposed by Fayyad, 
Piatetsky-Shapiro and Smyth, "Knowledge Discovery in 
Databases is the non-trivial process of identifying valid, novel, 
potentially useful, and ultimately understandable patterns in 
data (Fayyad et al , 1995). The goal is to distinguish from 
unprocessed data, something that may not be obvious but is 
valuable or enlightening in its discovery. Extraction of 
knowledge from raw data is accomplished by applying Data 
Mining methods. KDD has a much broader scope, of which 
data mining is one step in a multidimensional process. 

Knowledge Discovery in Databases Process 

Steps in the KDD process are depicted in the following 
diagram. It is important to note that KDD is not accomplished 
without human interaction. The selection of a data set and 
subset requires an understanding of the domain from which 
the data is to be extracted. For example, a database may 
contain customer address that would not be pertinent to 
discovering patterns in the selection of food items at a grocery 
store. Deleting non-related data elements from the dataset 
reduces the search space during the data mining phase of 
KDD. If the dataset can be analyzed using a sampling of the 
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data, the sample size and composition are determined during 
this stage. 



Fig. I: Steps in KDD Process. 

Databases are notoriously ’’noisy" or contain inaccurate or 
missing data. During the preprocessing stage the data is 
cleaned. This involves the removal of "outliers" if appropriate; 
deciding strategies for handling missing data fields; 
accounting for time sequence information, and applicable 
normalization of data( Fayyad.1996) 

In the transformation phase attempts to limit or reduce the 
number of data elements that are evaluated while maintaining 
the validity of the data. During this stage data is organized, 
converted from one type to another (i.e. changing nominal to 
numeric) and new or "derived" attributes are defined. 

At this point the data is subjected to one or several data mining 
methods such as classification, regression, or clustering. The 
data mining component of KDD often involves repeated 
iterative application of particular data mining methods. "For 
example, to develop an accurate, symbolic classification 
model that predicts whether magazine subscribers will renew 
their subscriptions, a circulation manager might need to first 
use clustering to segment the subscriber database, and then 
apply rule induction to automatically create a classification for 
each desired cluster (Simoudis, 1996). Various data mining 
methods will be discussed in more detail in following sections. 

The final step is the interpretation and documentation of the 
results from the previous steps. Actions at this stage could 
consist of returning to a previous step in the KDD process to 
further refine the acquired knowledge, or translating the 
knowledge into a form understandable to the user. A 
commonly used interpretive technique is visualization of the 
extracted patterns. The results should be critically reviewed 
and conflicts with previously believed or extracted knowledge 
resolved. 

Understanding and committing to all phases of the data mining 
process is crucial to its success. 


Data Mining Models 

A few of the many model functions being incorporated in 
KDD include: 

Classification: mapping or classifying data into one of several 
predefined classes (Hand, 1981). For example, a bank may 
establish classes based on debt to income ratio. The 
classification algorithm determines within which of the two 
classes an applicant falls and generates a loan decision based 
on the result. 



Figure 2: Regression Analysis 


Regression: "a learning function which maps a data item to a 
real-valued prediction variable (Hand, 1981). Comparing a 
particular instance of an electric bill to a predetermined norm 
for that same time period and observing deviations from that 
norm is an example of regression analysis. 

Clustering: "maps a data item into one of several categorical 
classes (or clusters) in which the classes must be determined 
from the data, unlike classification in which the classes are 
predefined. Clusters are defined by finding natural groupings 
of data items based on similarity metrics or probability density 
models (Fayyad et al , 1996). An example of this technique 
would be grouping patients based on symptoms exhibited. The 
clusters need not be mutually exclusive. 

Summarization: generating a concise description of the data. 
Routine examples of these techniques include the mean and 
standard deviation of specific data elements within the dataset. 

Dependency modeling: developing a model that shows a how 
variables are interrelated. An example would be a model 
showing that electrical usage is highly correlated with the 
ambient temperature. 


14 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 












































International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


Choosing a Data Mining Model 

There are no established guidelines to assist in choosing the 
correct algorithm to apply to a dataset. Typically, the more 
complex models may fit the data better but may also be more 
difficult to understand and to fit reliably (Fayyad et al , 1995). 
Successful applications often use simpler models due to the 
their ease of translation. Each technique tends to lend itself to 
a particular type problem. Understanding the domain will 
assist in determining what kind of information is needed from 
the discovery process thereby narrowing the field of choice. 
Results can be broken into two general categories; prediction 
and description. Prediction, as the name infers, attempts to 
forecast the possible future values of data elements. Prediction 
is being applied extensively in the area of finance in an 
attempt to forecast movement in the stock market. Description 
seeks to discover interpretable patterns in the data. Fraud 
detection is an application that uses description to identify 
characteristics of potential fraudulent transactions. 

Classification, clustering, summarization and dependency 
modeling are descriptive models, while regression is 
predictive. 

Current Applications of KDD 

Several Knowledge Discovery Applications have been 
successfully implemented. "SKICAT, a system which 
automatically detects and classifies sky objects image data 
resulting from a major astronomical sky survey. SKICAT can 
outperform astronomers in accurately classifying faint sky 
objects(Fayyad et al , 1995). KDD is being used to flag 
suspicious activities on two frontiers: Falcon alerts banks of 
possible fraudulent credit card transactions and the FAIS 
system being employed by the Financial Crimes Enforcement 
Network detects financial transactions that may indicate 
money laundering (Simoudis, 1996). Market Basket Analysis 
(MBA) has incorporated discovery driven data mining 
techniques to gain insights about customer behavior. Other 
applications are being used in the Molecular Biology, Global 
Climate Change Modeling and other concentrations where the 
volume of data exceeds our ability to decipher its meaning. 

Privacy Concerns and Knowledge Discovery 

Although not unique to Knowledge Discovery, sensitive 
information is being collected and stored in these huge data 
warehouses. Concerns have been raised about what 
information should be protected from KDD-type access. The 
ethical and moral issues of invasion of privacy are intrinsically 
connected to pattern recognition. Safeguards are being 
discussed to prevent misuses of the technology. 

Summary 

Knowledge Discovery in Databases is answering a need to 
make use of the mountains of data that is accumulating daily. 
KDD enlists the power of computers to assist in the 


recognizing patterns in data, a task that is exceeds human 
ability as the size of data warehouses increase. New methods 
of analysis and pattern extraction are being developed and 
adapted to KDD. Which method is used depends on the 
domain and results expected. The accuracy of the recorded 
data must not be overlooked during the KDD process. Domain 
specific knowledge assists with the subjective analysis of 
KDD results. Much attention has been given to the data 
mining phase of KDD but earlier steps, such as data cleaning, 
play a significant role in the validity of the results. 

The potential benefits of discovery driven data mining 
techniques in extracting valuable information from large 
complex databases are unlimited. Successful applications are 
surfacing in industries and areas were data retrieval is 
outpacing man's ability to effectively analyze its content. 
Users must be aware of the potential moral conflicts to using 
sensitive information. 
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ABSTRACT 

While knowledge management (KM) is becoming 
an established discipline with many applications 
and techniques, its adoption in health care has 
been challenging. Though, the health care sector 
relies heavily on knowledge and evidence based 
medicine is expected to be implemented in daily 
health care activities; besides, delivery of care 
replies on cooperation of several partners that 
need to exchange their knowledge in order to 
provide quality of care. This publication will give 
an overview of KM, its methods and techniques. 
Keywords 

Knowledge Management, Data Mining, Knowledge 
Mining 

1. INTRODUCTION 

In service base companies, knowledge is a central 
intangible asset; knowledge management deals with 
the creation, use, reuse, dissemination of 
Knowledge. Knowledge Management (KM) became 
a discipline during the 80’s, and the growing role of 
information technologies enabled the development 
of efficient KM tools using databases and 
collaborative software. 

KNOWLEDGE MANAGEMENT A 
Brief History 

Knowledge management had always been a central 
question in human societies. Indeed, its roots are to 
be found in the early history of human societies. 
Philosophers, Western as well as Eastern, have 
focused their attention on the question of 


knowledge; already in ancient Greece, ‘scientific’ 
discussions often lead to philosophical debates, 
especially on the concept of knowledge. The creation of 
epistemology has finally formalized the question of 
knowledge; indeed, epistemology addresses primarily 
the question of “what is knowledge?” and discusses its 
creation and adoption. In the current discipline of 
knowledge management, philosophical considerations 
from several schools are taken into account, especially in 
the ontological knowledge management field (Grenon, 
2003). 

On the other hand, practical knowledge management 
has always taken place in the society, and transmission 
of knowledge was much related to the technical 
progress. Beginning in the middle age, knowledge 
transmission occurred under what was called 
“Wandergesellen” in Germany and “Compagnonnage” 
in France, where craftsmen and artisan take a tour of the 
country for 6 months or one year to learn from several 
masters. This was one of the first structured 
methodologies for tacit knowledge transmission. 
Knowledge first spread orally, then in writing; but it was 
restricted to a low circle of educated people till the 
development of printing. If the first printing focused on 
religious and literature purpose, technical and 
specialized books began to spread after the wide 
adoption of the printing press. 

In the 20th century, management as well as cognitive 
sciences and psychology led to today’s Knowledge 
Management (KM) (Wiig, 1999). The current situation 
of KM started in the 1980s with the wide use of 
information technologies in companies; the focus was on 
the intangible asset that knowledge represents. The word 
KM itself appeared in the 80s and the academic 
discipline was created in 1995 (Stankosky, 2005). 
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Goals and challenges of KM are many; for 
companies, KM should increase their performance, 
help to develop partnerships, evaluate risks, 
organize management and enhance their economic 
value. Development of corporate memory and 
measurement tools also aims at assessing intangible 
assets in the companies. Besides, knowledge transfer 
enhancement and companies’ performance 
evaluations became is-sues of major importance. 
After twenty productive years in KM, the first 
criticisms appeared in 2002. T.D. Wilson (Wilson, 
2002) discussed the foundation of KM, mainly 
because of the difficulty to distinguish information 
from knowledge in most KM theories. He drew the 
conclusion that KM was a management fad and 
should disappear in the upcoming years. Actually 
KM survived well those criticisms, even if the 
precision of the vocabulary is not comparable to the 
one used in epistemology or in computer science 
based KM; probably the reason lies in the real need 
for companies as well as public organizations to use 
KM methods. 

We can distinguish 2 main KM trends: people 
and information management centered and in¬ 
formation technology centered. We should also 
recognize two other main orientations, the first is the 
need of evaluation in terms of performance 
measurement, and the second is the measurement of 
knowledge assets in order to evaluate the value of an 
organization (Wiig, 1999). 

KM Frameworks 

Frameworks for KM support are based on 
considerations related to the structure of knowledge 
and to the structure of organizations where the 
frameworks are applied. In most of models, 
knowledge types are determined based on different 
criteria, such as having structured or unstructured 
knowledge, and having tacit or explicit knowledge. 

First we have to make a distinction between high 
level frameworks and implementation oriented ones. 
The latter one focus on the “how to” implement KM 
in an organization, whereas the first one discuss the 
question of “what is KM” (Wong & Aspinwall, 
2004). As our purpose is to focus more on the “how 
to” question, we will focus in the next paragraphs on 
the implementation oriented frameworks. 

High level frameworks discuss how to fill the 
gap between theory and practice,that is the case of 
Knowledge Creation Frameworks for example 
(Siebert, 2005). 

Nonaka and Takeuchi (Nonaka & Takeuchi, 1995) 
depict steps to create knowledge in practice that go 
from perception to representation and from tacit 


knowledge to explicit one; they also show how those 
steps can enhance company’s efficiency. 


Concerning implementation frameworks, Sunassee and Sewry 
(Sunassee & Sewry, 2002) defined three categories of 
frameworks: prescriptive, descriptive and hybrids. 

Prescriptive frame-works give direction concerning the 
procedures that should be used, without describing precisely 
their content or implementations, for example step approach 
frameworks are mainly prescriptive frameworks (Wong & 
Aspinwall, 2004). Descriptive frameworks describe the key 
factors of KM that can drive KM initiatives to success or to 
failure, their forms of representations are mostly graphical 
(Wong & Aspinwall, 2004); examples of descriptive 
frameworks can be found in (Gore 
1. Gore, 1999; Holsapple & Joshi, 2002). Finally 
Hybrid approaches combine both prescriptive and 
descriptive methods. 

It is important to find a way to compare KM 
frameworks; though, frameworks are dedicated to 
specific applications which make their comparison 
complicated. Wong and Aspinwall (Wong & 
Aspinwall, 2004) proposed a comparison method of 
frameworks based on four elements, their structure, 
the knowledge types they represent, the KM 
processes and the KM influences or factors. 

Methods and Techniques in KM 

We can categorize the methods and techniques in 
KM in three groups: people and technology, 
requirements elicitation and value measurement. 

People and Technology 


Early approaches of KM frameworks in the early 
1990s mainly focused on the structural organization 
and IT solutions to improve knowledge 
management (Wiig, 1999). Those methods were 
adapted for slow moving businesses were goals and 
technical solutions are perfectly identified and the 
market does not evolve quickly. But, these 
approaches were not adapted in a subsequent fast 
moving business environment where new 
challenges started arise as fast as they disappear. 


Human centered KM has been early identified and 
became a new school of thought, in the early 1990s. 
Peters (Peters, 1994) wrote “the answer turns out to lie 
more with psychology and marketing of knowledge 
within the family than with bits and bytes”. Nowadays 
frameworks take both human and technical perspectives 
into ac-count. We will discuss both approaches 


separately and show how both are integrated in 
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Human Perspective: Motivation and Adoption 

The main issue for companies is to motivate 
employees to use KM systems. Not only that the 
technology matters, but people implication in KM 
initiatives is a key factor for its success. Without 
incentives, employees are not ready to share their 
knowledge; therefore, numerous solutions have been 
proposed to motivate employees to make use of KM 
systems. Some companies provide financial 
incentives (monetary rewards) or non-fmancial 
incentives (air miles, days off) for the first users of 
the KM system. Incentives, financial or not, are 
particularly efficient in organization where detaining 
knowledge is often considered as a source of power. 
In addition to individual incentives, Zand (Zand, 
1997) suggests a collaborative win-win reward 
system, in which a gain for an individual can be a 
gain for his peers, in opposition with classical win- 
lose rewards system. 

It has also been recognized that higher man-agement 
should use the system too; Liebowitz (Liebowitz, 
1999) cites the success of the KM network of 
Buckman Labs, which was mostly a result of the 
high level implication of the senior management and 
especially the CEO. 

The second motivation related issue is knowledge 
adoption ; it has been a challenge that people were 
not ready to use or apply knowledge developed by 
others. Sussman and Siegal (Sussman & Siegal, 
2003) built a theoretical model to understand the 
underlying issues of knowledge adoption; their 
study discussed the role of informational influence 
in the process of knowledge adoption, and showed 
the importance of the source credibility to convince 
people of the usefulness of the acquired knowledge. 
Once again, the commitment of senior management, 
who are trusted in their organization, can have a 
huge influence on the success of a project. 

Technical Perspective: Data Mining, Inference 
Engine and Multi-Agent Systems 

KM tools deals with explicit knowledge, meaning 
that Knowledge can be written on a support that is 
mainly an electronic one. Historically, collaborative 
tools, such as Lotus Notes, have been developed in 
the 1990s to enhance KM. Recent corporate tools 
widely adopted Web 2.0 technologies such as wiki 
platforms, semantic widgets, tagging and so on. 
Several concepts from the broad computer science 
research, such as data mining , rules based 
reasoning , and multi-agent systems have been 
integrated in KM solutions, the integration of those 
tools depends on the processes in action. 


For instance, computer assisted Knowledge Discovery is 
mainly based on data mining techniques. A brief look on 
the papers of the Knowledge Discovery and Data mining 
(KDD) conference (Li, Liu, & Sarawagi, 2008) - the 
major conference on Knowledge Discovery - gives an 
overview of the overwhelming presence of data mining 
within Knowledge Discovery. 

On the other hand, knowledge representation uses 
ontological models; due to the development of powerful 
inferences engines. Those representations can be used to 
infer knowledge from existing one, and shore up 
Knowledge Discovery processes. Several KM 
frameworks are based on ontologies (Fensel, 2002; 
Stojanovic, 2003; Sure, 2002; S.-Y. Yang, Lin, Lin, 
Cheng, Soo, 2005), since high level representation of 
Knowledge using ontologies enables powerful queries 
and Knowledge manipulation, retrieval and discovery. 

Finally, the multi-agents system (MAS) paradigm is 
rightly suited to model the distribution of knowledge on 
autonomous entities, thus, it is used in order to 
disseminate knowledge among employees in 
organizations. MASs also take in account reactivity 
(adaptation to changes in an environment) and 
proactivity (anticipation of user needs and consequently 
taking initiatives). These two factors are the keys for the 
success of a KM project; indeed, KM initiatives require 
adapting quickly and being able to handle user needs. In 
this context, Virtual Knowledge Communities (Maret, 
Subercaze, & Calmet, 2008) present an efficient way to 
model KM in organization since it integrates the MAS 
approach and the ontological representation of 
Knowledge. Virtual model Knowledge Communities’ 
model has been used for business (Subercaze, Pawar, 
Maret, & Calmet, 2008) as well as for health care 
purposes (El Morr, Subercaze, Maret, & Rioux, 2008). 

Requirements Elicitation 

Requirements can be seen under two angles, a 
technological one and a human centered one. 

From the technological stand point, storage of 
Electronic Knowledge Repository represented a 
challenge at the early stages of KM; indeed hardware 
investment can require significant amount of money for 
a huge amount of data to process. Knowledge Discovery 
processes also require high computational power; 
nevertheless with the reduction of hardware costs, 
storage is no more a critical issue, but the latest research 
using ontological representation, inference engines, and 
data mining techniques still required powerful 
computational power (Guo, Pan, & Heflin, 2005). 
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Value Measurement 

Assessing the value of KM is a primary concern 
for organizations. Like other intangibles assets, 
the reliability of Knowledge Management 
measurement in an organization is subject to 
debate. As underlined in a study for the 
European Union (Zambon, 2003), internal 
evaluations based on information provided by 
managers may be subject to bias and tend to 
overestimate the value of KM. On the other 
hand, evaluations conducted by third 
parties may be imprecise, as third parties may not 
have access to the internal knowledge assets. The 
absence of a market for intangible asset can also be 
a root of evaluation bias; indeed, knowledge as an 
intangible asset will be evaluated and appear on the 
financial report but cannot be sold and has no proper 
market value. Therefore, there is no market structure 
that can regulate knowledge evaluation. 

Several methods have been developed to estimate 
the value of knowledge in an organization, Skandia 
is the first company to have dealt with the 
Intellectual Capital (IC) measurement (N. Bontis, 
1996). It defined Intellectual Capital as the sum of 
the human and structural Capital. Human capital 
combines abilities, knowledge, and innovation 
potential of the company’s employees; it includes 
the company’s philosophy and culture too. This kind 
of capital is not property of the company, but the 
company drives benefits out of it. Structural capital 
is the patents, trademarks, hardware and “everything 
that gets left behind when employees go home” 
(Nick Bontis, 2001). IC reports developed by 
Skandia used 36 metrics to give a monetary value to 
an organization; metrics includes customer 
satisfaction, satisfied employees, number of patents, 
annual turnover. Second generation methods such as 
IC-index , was an extension of the Skandia IC metric, 
it tried to merge the different indicators of Skandia 
into a single index (Roos, Roos, Edvinsson, & 
Dragonetti, 1997). Other metrics were developed to 
evaluate Knowledge Management Systems (KMS), 
Kankanhalli et al. (Kankanhalli & Tan, 2004) 
present a thorough review of KMS metrics. 
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ABSTRACT 

Knowledge management (KM) is becoming an 
established discipline with many applications and 
techniques, its perspective view in health care has 
been challenging. Though, the health care sector 
relies heavily on knowledge and evidence based 
medicine is expected to be implemented in daily 
health care activities; besides, delivery of care 
replies on cooperation of several partners that 
need to exchange their knowledge in order to 
provide quality of care. This publication will give 
a perspective view of KM in Health care. 
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1. INTRODUCTION 

In service base companies, knowledge is a central 
intangible asset; knowledge management deals with 
the creation, use, reuse, dissemination of 
Knowledge. Knowledge Management (KM) became 
a discipline during the 80’s, and the growing role of 
information technologies enabled the development 
of efficient KM tools using databases and 
collaborative software. 

Beside the current knowledge management roles in 
the health care sector, few perspectives present an 
opportunity to develop new health care KM 
applications. These perspectives are virtual 
communities, mobility, Electronic Health Record 
(E.H.R.), and public health. 


Virtual Communities 

“Virtual” health care providers of different 
disciplines (e.g. medicine, nursing, social work, 
physical therapy, etc.) can create teams in which 
they combine their knowledge and expertise to 
provide a comprehensive plan of care. Though, it is 
essential to include patients in virtual health care 
teams; indeed, patients must be well informed about 
their conditions, treatment options and how to 
access them and be actively involved in their 
treatment (Davis, Wagner, & Groves, 2000). Health 
Virtual Communities, that include care givers and 
patients, in order to create manage and coordinate 
virtual medical teams (Pitsillides, et al., 2004). 

Once a Health VC is in place, new knowledge 
emerges through social interactions (Ahmad, 
Kausar, & David, 2007). Patients have tacit 
knowledge about their medical condition and the 
way they experience their conditions, this tacit 
knowledge constitute a mine of information for 
clinical practice; indeed, it allows to get insight into 
the patient experience and hence assess her/ his 
quality of life as well as the impact of a drug on a 
person’s life. Virtual communities in this respect 
constitute an opportunity for a holistic approach to 
clinical practice. 

Besides, Health VCs constitute an opportunity 
for e-continuing education. In health care, 
continuous education is essential; some 
professionals cannot continue practising unless they 
undergo a yearly continuous education course in 
order to update their knowledge. In this context, 
knowledge based Health VCs can play a major role 
by pro-viding a platform for e-education and 
knowledge exchange between peers. The creation of 
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virtual network of experts opens the road to test 
new kind of cooperation paradigms and peer-to- 
peer e-educational paradigms (van Dijk, 
Hugenholtz, A-Tjak, & Schreinemakers, 2006). 

Mobility 

While managing knowledge will become an 
important daily practice, the future will be more 
mobile. We’re witnessing already the explosion of 
mobile interactive devices, mobile health facili-ties, 
and the proliferation of e-homecare solutions 
(Hubert, 2006). Mobile knowledge management is 
the next step in mobile health care situations where 
patient is away from the point of care (O’Sullivan, 
McLoughlin, Bertolotto, & Wilson, 2007). The 
mobility approach is extremely valid in virtual 
communities (El Morr, 2007; Christo El Morr 
& Jalal Kawash, 2007; C. El Morr & J. 
Kawash, 2007); consequently, the creation of mobile 
Health VCs where knowledge is generated, 
disseminated and shared by both patients and 
caregivers is a next step that can provide advantage 
for both patients and caregivers (Hubert, 2006; 
Moreno & Isem, 2002; Siau & Shen, 2006). 

Electronic Health Record (E.H.R.) 

Worldwide, governments are striving to build 
national wide E.H.R. systems. There has been 
progress in this direction mainly in developed 
countries. Once health records are computerized the 
need will be to reach the right information about a 
patient at the right time, and to use the E.H.R. data 
for diagnosis purposes, for personal health decision 
support, for public health decision support, and for 
research purposes as well. Though, much of what 
has been done till now in E.H.R. involves data 
processing mainly (Van Vlymen, De Lusignan, 
Hague, Chan, & Dzregah, 2005); besides, health 
service managers are facing many difficulties when 
trying to access relevant data routinely for quality 
improvement (De Lusignan, Wells, Shaw, 
Rowlands, & Crilly, 2005). KM techniques can play 
here two roles one for managers and one for 
practitioners; indeed, KM techniques can help in 
searching for knowledge 


(2) in the mass of data gathered helping 
practitioners to find more effective ways to treat patients 
by searching for similar patient case histories 
(O’Sullivan, et al., 2007), and helping managers to get 
relevant knowledge for total quality management (TQM) 
(MeAdam Leonard, 2001). Establishing, electronic 
health records, per se, constitute only a first step; using 
the mass of data gathered in order to support prac¬ 
titioners in generating knowledge and providing quality 
practise is the challenge ahead. 

Evidence-Based Public Health 

Networks for health care surveillance continue to evolve 
(Health Canada, 1999); nevertheless, studies show that 
information and communication technology are less 
used in public health than in other sectors of the society 
(Goddard, et al., 2004; Revere, et al., 2007). Public 
health is traditionally data processing and data analysis 
oriented, though there is more awareness that a shift is 
needed in public health from data driven decision 
making to knowledge driven decision making, or to put 
it in Goddart et al. words “pro-vide direct guidance on 
the relative effectiveness of different interventions in a 
specific situation” (Goddard, et al., 2004). KM can play 
a vital role in organizing, structuring and supporting 
evidence based public health decision making (Andreas 
& Nicholas, 2006; Revere & Fuller, 2008). In this 
context, research needs to unveil how the public health 
community communicates and cooperate, particularly in 
terms of role and communication strategies, artifact 
used, etc. Different profiles of knowledge health care 
workers can then be sketched. Research methods from 
the Computer Supported Collaborative Work (CSCW) 
field can be used. Findings can well be integrated in the 
context of Community of Practice where knowledge 
tools can further knowledge creation, communication 
and transfer. The medical field is experiencing a move to 
evidence based medicine, a similar move to evidence- 
based public health is important and would be strategic 
for an advanced management of population health; KM 
can play a vital role to make this move. 
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Knowledge Transfer 

Knowledge transfer is concerned with dissemination 
of knowledge connecting and adapting research 
findings to the society needs. Increas-ingly, the role 
of knowledge broker is recog-nized as vital in 
knowledge transfer (Lind & Persborn, 2000); 
knowledge brokering “links decision makers with 
researchers, facilitating their interaction”(Canadian 
Health Services Research Foundation, 2003; Lomas, 
2007). In this context, there is a crucial need to 
understand how knowledge is transferred, and 
transformed while it is transferred (Syed, 1999); 
cognitive theory can be of much help in this domain. 
This understanding will help providing a feedback to 
knowledge generators (i.e. researchers) and widen 
their knowledge (i.e. help generating more 
knowledge) (Figure 1). 

Health 2.0 and Semantic Web 

The term health 2.0 embeds the concepts of 
healthcare, web 2.0 and e-health. Following the 
web 2.0 principles, health 2.0 is driven by 
participatory ideas. In health 2.0, each actor of 
the system, patients, stakeholders are involved 
in the process of amelioration of the health care 
system using existing web 2.0 social 
networking, semantic web and collaborative 
tools. 

As well as web 2.0, health 2.0 is an imprecise 
term. Most of the applications are focused on 
enhancing communication in the community. 

For instance, Sermol is physicians community 
dedicated for information exchange and 
collaboration’, and DoubleCheckMD2 is a 
patient oriented applications dedicated for drugs 
side effects; while Vitals3 help patients find a 
relevant doctor matching search criteria and 
write reviews on doctors, and PatientsLikeMe4 
is an online community for patient with life 
threatening conditions. In health 2.0, allow 
patients to share their experience, medical data 
with other patients, doctors and research 
organizations; it aims at establishing data- 
sharing partnerships. 


The trend in health 2.0 is to enhance collaborations, 
either between physicians or between patients, and 
to create new relationships between patients and 
doctors and research organizations. 

On the other hand, semantic Web technologies enable a 
next step in accessing data at the scale of the web; 
indeed, RDF and OWL technologies are being used for 
knowledge modeling and for large database 

integrations. Currently, the W3C Semantic Web in 
Health Care and Life Sciences Interest Group 
(HCLSIG) aims at offering a better access, 
to information from many domains and processes for an 
efficient decision support and disease management. 
Initiatives like OBI (Ontology for Biomedical 

Investigations) or RNA Ontology Consortium are the 
results of the movement initiated by the HCLSIG. 

While, the current health 2.0 applications are based on 
relational databases; we believe that in the near future, 
we will see a merger between health 2.0 and the 
semantic web technologies developed by HCLSIG. The 
resulting applications could fairly improve automated 
knowledge management related to healthcare. 

CONCLUSION 

Knowledge management in health care is progressing; 
the complexity and challenges facing the health care 
sector can be addressed by adopting of KM strategies. 
The use of KM in health care is promising to enhance 
the quality of care for patients by providing them with a 
continuity of care. The implementation of Health care 
KM system will allow health care partners (e.g. 
practitioners, administrators, etc.) to conduct evidence 
based practice and to collaborate relying on the best 
knowledge available. 

This is a challenge that opens the way to more 
innovations in both KM and health. The current state of 
KM in health care can be improved; we believe that new 
practices such as, health 2.0 applications, VCs and 
evidence based medicine will help to increase the global 
quality of care of the patients as well as the efficiency of 
KM in healthcare. 
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ABSTRACT 

Robotics and artificial intelligence serve very different 
purposes. However, people often get them mixed up. Robotics 
is a branch of technology which deals with robots. Robots are 
programmable machines which are usually able to carry out a 
series of actions autonomously, or semi-autonomously. 
Artificial intelligence (AI) is a branch of computer science. It 
involves developing computer programs to complete tasks 
which would otherwise require human intelligence. 

Keywords: Robotics, artificial intelligence, algorithm. 


Introduction 

Robotics and artificial intelligence serve very different purposes. 
However, people often get them mixed up. 

Are Robotics and Artificial Intelligence the Same Thing? The first 
thing to clarify is that robotics and artificial intelligence are not the 
same thing at all. In fact, the two fields are almost entirely separate. 
A Venn diagram of the two would look like this: 



Artificially 

intelligent 

Robots 


I guess that people sometimes confuse the two because of the overlap 
between them: Artificially Intelligent Robots. 

To understand how these three terms relate to each other, let's look at 
each of them individually. 

What Is Robotics? 

Robotics is a branch of technology which deals with robots. Robots 
are programmable machines which are usually able to carry out a 
series of actions autonomously, or semi-autonomously. 

In my opinion, there are three important factors which constitute a 
robot: 

1. Robots interact with the physical world via sensors and 
actuators. 

2. Robots are programmable. 

3. Robots are usually autonomous or semi-autonomous. 


Robots are "usually" autonomous because some robots aren't. 
Telerobots, for example, are entirely controlled by a human operator 
but telerobotics is still classed as a branch of robotics. This is one 
example where the definition of robotics is not very clear. 

It is surprisingly difficult to get experts to agree exactly what 
constitutes a "robot." Some people say that a robot must be able to 
"think" and make decisions. However, there is no standard definition 
of "robot thinking." Requiring a robot to "think" suggests that it has 
some level of artificial intelligence. 

However you choose to define a robot, robotics involves designing, 
building and programming physical robots. Only a small part of it 
involves artificial intelligence. 
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What Is Artificial Intelligence? 

Artificial intelligence (AI) is a branch of computer science. It 
involves developing computer programs to complete tasks which 
would otherwise require human intelligence. AI algorithms can tackle 
learning, perception, problem-solving, language-understanding and/or 
logical reasoning. 

AI is used in many ways within the modem world. For example, AI 
algorithms are used in Google searches, Amazon's recommendation 
engine and SatNav route finders. Most AI programs are not used to 
control robots. 

Even when AI is used to control robots, the AI algorithms are only 
part of the larger robotic system, which also includes sensors, 
actuators and non-AI programming. 

Often — but not always — AI involves some level of machine 
learning, where an algorithm is "trained" to respond to a particular 
input in a certain way by using known inputs and outputs. We discuss 
machine learning in our article Robot Vision vs Computer Vision: 
What's the Difference? 

The key aspect that differentiates AI from more conventional 
programming is the word "intelligence." Non-AI programs simply 
carry out a defined sequence of instmctions. AI programs mimic 
some level of human intelligence. 

What Are Artificially Intelligent Robots? 

Artificially intelligent robots are the bridge between robotics and AI. 
These are robots which are controlled by AI programs. 

Many robots are not artificially intelligent. Up until quite recently, all 
industrial robots could only be programmed to carry out a repetitive 
series of movements. As we have discussed, repetitive movements do 
not require artificial intelligence. 

Non-intelligent robots are quite limited in their functionality. AI 
algorithms are often necessary to allow the robot to perform more 
complex tasks. 

Let's look at some examples. 

Example: Non-Artificially Intelligent Cobot 

A simple collaborative robot (cobot) is a perfect example of a non- 
intelligent robot. 

For example, you can easily program a cobot to pick up an object and 
place it elsewhere. The cobot will then continue to pick and place 
objects in exactly the same way until you turn it off. This is an 
autonomous function because the robot does not require any human 
input after it has been programmed. However, the task does not 
require any intelligence. 

Example: Artificially Intelligent Cobot 

You could extend the capabilities of the cobot by using AI. 


Imagine you wanted to add a camera to your cobot. Robot vision 
comes under the category of "perception" and usually requires AI 
algorithms. 

For example, say you wanted the cobot to detect the object it was 
picking up and place it in a different location depending on the type 
of object. This would involve training a specialized vision program to 
recognize the different types of object. One way to do this is using an 
AI algorithm called Template Matching, which we discuss in our 
article How Template Matching Works in Robot Vision. 

Conclusion 

As you can see, robotics and artificial intelligence are really two 
separate things. Robotics involves building robots whereas AI 
involves programming intelligence. Software robot" is the term given 
to a type of computer program which autonomously operates to 
complete a virtual task. They are not physical robots, as they only 
exist within a computer. The classic example is a search engine 
webcrawler which roams the internet, scanning websites and 
categorizing them for search. Some advanced software robots may 
even include AI algorithms. However, software robots are not part of 
robotics. 
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Abstract — Resource planning in university is a very hard 
management science problem. Faculty members are expensive 
resource that a university needs to utilize them efficiently and 
deploy them effectively for courses that they can teach. In this 
paper, we focus on one of the most important problems in the 
universities - the academic calendar which comprised of faculty- 
course assignment, course scheduling and timetabling. We 
propose an innovative two-steps approach to solve the problem 
using mathematical models to optimize the resource allocation 
while satisfying the faculty preferences. We also showcase using a 
real-world example how this problem is solved easily and solution 
improves the productivity of the staff and enhances the satisfaction 
of faculty. 

Keywords—faculty assignment , timetable, university , minimze 
cost , class schedule 


I. Introduction 

Universities today are very complex and they serve 
thousands of students annually ranging from undergraduates, 
post-graduates to executives training programmes. The mission 
of the university in this study is to provide lifelong education, 
equipping learners to serve society in a positive manner. For 
universities, faculty members (or “faculty”) are the most 
expensive resources. Faculty involves in course development, 
assessment writing, and course delivery and grading. In 
addition, they also need to support other administrative tasks 
such as student recruitment, information sharing session, applied 
project supervision and industrial relationship. Some of them 
focus on research studies that require collaboration with the 
industry as an integral part of the faculty duties to keep in touch 
with the industry, understand the market needs and enhance the 
quality of classroom teaching. In the meantime, if there is an 
internship opportunity from the industry, students attend 
interviews and perform short-term internship in the company, 
for say ten weeks, or, if long-term, for six months. 

There are two semesters in the university, one run from 
August to December which is referred as “term 1” and January 
to June as “term 2”. Before the start of each semester, the 
administrative team needs to forecast the demand for each 
course and determine the required number of sessions for each 
course. Course is a series of classes, on a particular subject. A 
faculty may teach one session of the course, or teach multiple 
session of the same course. University offers some courses only 


once a year and others in every term. Some of the courses are 
foundation courses and, thereby the demand for such courses is 
relatively higher than the elective course, which may only have 
a single session. We also have a minimum class size of 20 and 
maximum class size of 50. If the demand falls below 20, 
university no longer offers the course to avoid sub-optimal use 
of resources. We also focus on the class interactivity thus we 
limit the number of students in the class to be greater than or 
equal to 20 but less than or equal to 50. 

Faculty includes full-time employees of the university who 
hold Ph.D or equivalent doctorate degree. Faculty assignment 
problems is a non-trivial problem as the faculty are not 
homogenous. They can only teach those courses that are in their 
area of experts or fields of study, and not all other courses, as the 
mismatch skills and knowledge which is not ideal for both 
faculty and students. Some professionals or instructors from 
other institutions or industry who are associated with the 
university (e.g., by teaching some courses or supervising 
students) but do not hold professorships may be appointed as 
adjunct faculty or associates. There is a large pool of associates 
who teach courses in their area of expertise. University deploys 
them for unfulfilled courses by the full-time faculty. In any case, 
the associates are part-time staffs on contract and they are not 
entitled to receive any medical and HR benefits like the full-time 
faculty. 

The faculty-course allocation (i.e. faculty assignment) is an 
important decision making process for the associate dean of the 
school with his administrative team need to make sure that all 
the course demands are met. All the courses are assigned to the 
faculty members (Full-time and part-time) who can teach the 
course well. The team also needs to ensure that all the full-time 
faculty members are fully-utilized before engaging the 
associates. This is done manually purely based on the historical 
teaching record and most of the times, faculty preferences are 
not taken into consideration. This raises a lot of concerns and 
dissatisfaction among the faculty members. The administration 
team also need to schedule the course to various timeslots based 
on the availability of the classroom. This is timetabling problem. 

The stated problem, faced by associate dean and administrative 
team every semester, is done manually now. It is mundane and 
involve a lot of man-hours to complete the task. Thus, we have 
proposed a decision support system which can plan the faculty 
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assignment and timetabling automatically with minimal human 
intervention taking into consideration of operational constraints. 

In the next few sections, we discuss the literature review; the 
important decision making process at one of the private 
universities in Singapore; data analysis to identify the issue of 
using the “gut-feel” and experience to do the resource planning 
at the university. We propose two-steps optimization model for 
faculty assignment and timetabling to meet the required 
demand and run the programme. The approach yields better 
results and improves the faculty satisfaction level as they are 
happy with the course allocation and schedule. We also present 
some computational result and performance of the models. 
Finally, we discuss about limitation of the model proposed and 
future work to be done as a conclusion. 

II. Literature Review 

Generating academic calendar for a university is a very 
challenging and time consuming task due to diverse demands, 
faculty preference and availability, limited time slots and class 
rooms’ requirement. Although, there are other assignment 
problems in a university, academic calendar problem is the 
most frequent and scheduling classes is the most challenging 
one. The academic calendar problem consists two kinds of 
assignment problems. One is the faculty assignment: to 
determine which faculty is assigned to the course based on their 
expertise and preference. And, the other is course scheduling: 
to determine the correspondence between class-timeslot and 
classrooms. The context and complexity of the assignment 
problems are dependent on the relevant university systems and 
various approaches have been proposed for the problem and 
solved by various researchers [1, 2, 3, 4, 5, 6 and 7]. 

Adewumi, Sawyerr, and Ali [2] addressed lecturer scheduling 
at a Nigerian University, and uses an iterative process to 
generate schedules based on the degree of violation of hard 
constraints. Daskalaki and Birbas, [5] developed a two-stage 
procedure for a department providing structured curricula for 
well-defined groups of students. The procedure includes a 
relaxation approach for computationally heavy constraints, and 
sub-problems to obtain timetables for each day of the week. 
Derigs and Jenal [7] described a Genetic Algorithm - based 
system for professional course scheduling using strategies such 
as pre-assigning subsets of courses. Dinkel Mote and 
Venkataramanan [9] used a network-based model considering 
the dimensions of faculty, subject, time, and room for the 
College of Business Administration at Texas A&M University. 
Other articles describing heuristic approaches to course 
scheduling in university environments include [4, 6, 8 and 10]. 

Fong, Asmuni, McCollum, McMullan and Omatu [11] 
proposed a new hybrid method which is a combination of a 
great deluge and artificial bee colony algorithm (INMGD- 
ABC) to solve the university timetabling problem. Artificial bee 
colony algorithm (ABC) is a population based method that has 
been introduced in recent years and has proven successful in 
solving various optimization problems effectively. However, as 


with many search based approaches, there exist weaknesses in 
the exploration and exploitation abilities which tend to induce 
slow convergence of the overall search process. Therefore, 
hybridization is proposed to compensate for the identified 
weaknesses of the ABC. 

Gunawan, Ng and Poh [12] developed a mathematical model to 
solve teacher assignment and course scheduling for a master 
course. An initial solution is obtained by a mathematical 
programming approach is based on Lagrangian relaxation. This 
solution is further improved by a simulated annealing 
algorithm. The proposed method has been tested on instances 
from a university in Indonesia, as well as on several randomly 
generated datasets, and the corresponding computational results 
are reported. 

Hinkin and Thompson [13] considered integrated teacher 
assignment and course scheduling at a university in Indonesia, 
and used a heuristic based on Lagrangian relaxation. The 
models were solved in phases using CPLEX[8]. The authors 
developed a computer program to automate the scheduling 
process, considering conflicts among core required courses, and 
among electives within areas. The program was used by an 
administrator in the student services office. 

Koide [14] developed a prototype system for the examination 
proctor assignment in Konan University by reference to the 
mathematical modeling by [16]. They focused on the proctor 
assignment and the target model considered some different 
types of constraints with respect to workload in a day from the 
constraints in [16] model. A mixed integer programming model 
was proposed and an optimal solution was derived through 
CPLEX [8], commercial optimization software. The resulting 
assignment sounded acceptable for the registrar staffs 
nevertheless some additional practical conditions were 
neglected for simplification of the mathematical model. This 
study extends the previous model and discusses the usefulness 
of the system for system users in the practical assignment task. 
The timetabling problem is generally large, highly constrained, 
and solution by exact optimization methods is difficult [15]. 

Onouchi, Uchigaito Sasaki M. [16] studied timetabling problem 
for final examinations in their university. The problem was 
solved in two stages; examination timetabling and classroom 
assignment were conducted in the first stage and proctor 
assignment in the second stage. The problems in both stages 
were formulated as mixed integer programming (MIP) and 
solved by using commercial optimization software. The authors 
proposed meta-heuristic approach as it is more easily 
comprehensible for system users. 

Tim Roughgarden [17] discussed the problem of optimizing the 
performance of a system for the concrete setting of scheduling 
“centrally controlled” jobs and formulated this goal as an 
optimization problem via Stackelberg games, games in which 
one player acts a leader (here, the centralized authority or 
academic dean interested in optimizing system performance) 
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and the rest as followers (the faculty members). The problem is 
then to compute a strategy for the leader (a Stackelberg 
strategy) that induces the followers to react in a way that (at 
least approximately) minimizes the total latency in the system. 

Based on the literature review, we have identified that faculty 
assignment and course-scheduling or timetabling is a very 
complex yet critically important to the university. Most 
researchers develop mathematical models and solve it using 
search algorithms or heuristics to achieve the goal. The 
contribution of our paper is as followed, we develop a two-steps 
approach to solve the faculty assignment and timetabling 
sequentially and finally use the output from both models to get 
the faculty schedule. This approach reduces the computation 
time tremendously as the number of variables are greatly reduce 
using our approach. 

III. Problem description 

The university discussed here is a new university which has just 
started operating recently. In each year, for performance 
appraisal, university ask faculty members to put in their 
preferences (first, second, third choices) in term of courses to 
teach. Full-time faculty members teach six classes in an 
academic year as per the employment contract; however, they 
may get some reduction in teaching load if they involve in course 
development or significant administrative work to support the 
programme. Their performances are measured by substantive 
contributions to the learning of their students and to their field, 
as well as to make service contributions to their field and the 
university. Therefore, student’s feedback along with 
effectiveness of course delivery, quality of course development 
and effectiveness of mentoring students are essential for their 
career development and mentoring to ensure accountability and 
equity across the faculty. In essence, their performance 
indicators are of two kinds: 1) those that denote scholarly 
activities relevant to a performance area and 2) those that 
provide service support of the quantity and quality of content 
delivery activity in a performance area. Neither the number of 
activities nor the number of supporting services necessarily 
indicates a high (or low) quality of performance; instead, 
consider a combination of quantitative and qualitative elements 
when evaluating performance. 

The adjunct faculty members are part-time staff and they need 
to fill in a form to indicate the courses that they are eligible to 
teach. Due to some human resource policy and regulation, 
adjunct faculty members teach no more than two classes in a 
year. By the end of term 2, the administrative team has the 
estimated number of students enrolled for each course before the 
new academic year begins in August. Based on the student 
enrollment, they compute the number of sessions required for 
each course. Each class needs a minimum class size of 20 
students and a cap of 50 due to the classroom capacity. If there 
are 120 students signed up for the course, for example, then the 
total number of sessions required is 3 and they try to balance the 
number of students in each class. Faculty are assigned to courses 


based on their expertise and the current objective is to ensure 
that all the courses demand are being fulfilled taking into 
considering of faculty preference. It is a multi-faceted decision 
making process for which a system will be required to solve it 
effectively to assign faculty to courses that they are able to teach 
and utilize them effectively. 

After the faculty assignment, the administrative team works on 
scheduling the classes. There are only fifteen time-slots 
available in a week. And, there are three time-slots daily, we 
referred as “tl or morning” for 08:30am to 11:30am, “t2 or 
afternoon” for 12:00pm to 3:00pm and lastly “t3 or evening” 
for 3:30pm to 6:30pm. Since there are a limited number of 
classrooms available in the university, classes are assigned to 
a timeslot, based on classroom availability. Therefore, one of 
the objectives is also to schedule all the courses to a timeslot 
and to an available classroom. At the end of this process, the 
timetable for all the course which denotes day of week, timeslot 
and the classroom, will be available. 

Once courses allocated and classes scheduled, the faculty 
members who teach the course are assigned to the schedule. If 
a member of a faculty has conflict with the allocated schedule, 
the member is allowed to internally swap with another faculty 
member within the course. A faculty member cannot teach 
more than one course at the same time-slot of the day. The final 
output is the faculty teaching schedule. 

The university would like to harness on the power of analytics, 
to develop a decision support system which automatically 
allocate, assign and schedule in that sequential order. The 
proposed solutions aide them in their decision making and 
achieve an optimal outcome. 

IV. Modeling & Compuational result 

We model the faculty assignment problem as an integer 
programming (IP) model. The objective for the model is to 
minimize the cost. The cost of allocating a course to a faculty is 
partially based on the preferences list. Each faculty member is 
asked to submit three preference courses each year. The model 
aims to assign the course to the faculty’s preference as far as 
possible so as to increase the satisfaction level of the staff. We 
assign lower cost if the course is in the preference list. The 
lowest cost is assigned to the first preference, which is followed 
by second and third choices. Example: we may assign 100 to 
first choice, followed by 200 to second choice and 300 to third 
choice. Higher penalty is applied for assigning a course to a 
faculty member outside of the preference list. It is also more 
expensive to engage an adjunct than the full-time faculty, 
therefore treat them separately from now on. 

Let i be the number of courses, i = 1, 2, 3,..., n 

Let j be the number of faculty, j = 1,2, 3, ...,m 

Let x t j be the decision variable, where course i is allocated to 

faculty j 

Let di be the demand for the course i 
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Let fj be the number of sessions required to teach for faculty j 
Let Cjybe the cost of assigning course i to faculty j 


Problem PI - Faculty assignment 


Objective: min £" =1 'ZJLi c ij x ij 

(1) 

Yjl i Xij > di, Vi = 1,2, 3,..., n 

(2) 

£”=i Xu < fj, Vj = 1,2,3,..., m 

(3) 

x lje Z + 

(4) 


The first equation is the objective function, we want to minimize 
the cost of assignment which is translated to meeting the 
preferences of each faculty as much as possible while meeting 
all the operational constraints. 

The equation (2) stated that number of faculty assigned for each 
course must be larger than or equal to the demand for the course. 
Since it is a minimization problem, the system assigns the 
minimum classes to the faculty. Equation (3) noted that the 
number of classes assigned to each faculty i, must be less than 
the required number of teaching sessions required. For normal 
full-time faculty, fj is 6 and for adjunct faculty, fj is 2. 

This mathematically model is compact. The challenge that we 
face during the modeling is to assign appropriate cost to c t j. We 
have taken into consideration of faculty’s preference, faculty’s 
expertise as well as the cost of conducting a course by full-time 
and adjunct. The cost of conducting a course for full-time faculty 
also varies according to the rank. Currently there are three ranks 
in the university, they are namely assistant professor, associate 
professor and professor ranks. The teaching cost for an assistant 
professor is the lowest, followed by 30% increased for 
associated professor and 50% increased for professor. These are 
users input and the users can change these inputs according to 
university pay-structure. 

After this assignment, the team initiates timetabling process by 
scheduling the courses to timeslots where there are available 
classrooms. There are some basic assumptions for this model. 
The class schedule/timetable for the course is the same weekly 
throughout the semester, which mean that if a course BA100 is 
scheduled on Monday 8:30am and the venue is room 4015, then 
the same schedule is valid for each week within a semester. The 
courses can be scheduled for three timeslots: (tl, t2 t3) from 
Monday to Friday, total of 15 timeslots, T, in a week. In this 
case, we try to find the best time slot to schedule classes each 
week according to classrooms availability. The schedule for a 
class remains the same for the whole semester, unless they fall 
on a public holiday and, in that case, we find a make-up class on 
Saturday of the same week. Since the university is expanding, 
the classroom facilities become scarce resource. We schedule a 
course at a particular timeslot and to an available classroom. 
Thus, the objective is to schedule all the courses to timeslots 
and available classrooms. After this schedule process, then we 
assign faculty members who teach the course to a timetable. If 
a faculty member has a conflict with the scheduled timetable, 


then we allow them to internally swap with another faculty 
member from the same course. A faculty member cannot teach 
more than one course at the same timeslot of the day, they 
should be allocated different timeslots. Timetable is a collection 
of timeslots for a week, for a course or for a faculty member. 

Fet i be the number of courses, i = 1, 2, 3,..., n 
Fet t be the number of timeslots available, t = 1, 2, 3, .. .,T 
Fet y it be the decision variable, where course i is allocated to 
time t 

Fet d t be the demand for the course i 

Fet r t be the number of classrooms available 

Fet c it be the cost of assigning course i to timeslot t 

Problem P2 - Timetabling 


Objective: min Zf=i £[=i c it y it (5) 

Iit =1 y u >di,Vi = 1,2,3,...,n (6) 

ZUyu<r t ,Vt = 1,2,3 . T (7) 

yite {0,1} (8) 


The equation (5) is the objective function, we want to minimize 
the cost of assigning course i to timeslot t, if we want to schedule 
more afternoon classes provided there are classroom available, 
then we can assign less cost to noon timeslot each day. It is 
generally true that students prefer noon classes than early 
morning class or evening class. 

The equation (6) stated that number of timeslots allocated to 
each course must be larger than or equal to the number of classes 
required for the course. Equation (7) noted that the number of 
classes assigned in each time slot t, must be less than the 
classroom available. Finally, the decision variable, y it is a 
binary variable; it is assigned 1 if course i is assigned to time t 
or 0 otherwise. Assuming, for Cl courses there are 3 classes 
weekly. The weekly class schedule can be on tl (Monday 
morning), tl 1 (Thur noon), tl4 (Fri noon). 

Finally, we match the faculty assignment to the course timetable 
using some basic rules and we get the faculty schedule. We 
ensure that one faculty can’t teach two classes at the same 
timeslot. Otherwise, we perform some minor adjustments. 

The model can be solved very quickly using Excel solver if the 
problem is not big or using some commercial software like IBM 
CPFEX Optimizer. We have used past university data and run 
the model for 1 year. For the 20 courses, 80 classes and 16 
faculty, we are able to solve the problem in less than 30 minutes. 
It greatly improve the productivity of the administrative team 
and they just need to look at the preliminary result and made 
some minor adjustments when required. 

We use the following example to illustrate how our approach 
works. Assuming we have 10 courses, 40 classes and 5 full-time 
faculty and 5 part-time adjunct faculty. The faculty has given 
their preferences as well as courses that they can teach in 
advance. There are 15 timeslots in a week namely, tl, t2 .. tl5. 
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tl represents Monday morning, t2 represents Monday noon and 
tl5 for Friday evening. There are also only three classrooms 
available to book for each timeslot. What is the faculty 
assignment, timetable for the courses and faculty timetable? 
Using the above formulation and using Excel solver, we are able 
to solve the problem in minutes. Here are some sample output 
from the model. 

This formulation has some limitations. The model may allocate 
different courses to faculty, e.g. they need to teach four different 
courses in a year which is neither desirable as one faculty 
shouldn’t teach more than three different courses in a year, nor 
optimal for the program. But the occurrence of this allocation is 
very rare as the model try to assign the course within the faculty 
preference list of three choices. 

Assuming that we have 10 courses, 10 faculty member to assign 
in 15 timeslot, the number of variables to solve all of them in 
one model is 10x10x15 = 1500 variable. If we split the problem 
into two-steps approach, the total number of variables for 
problem PI is 10x10= 100 andP2 is 10x15 =150, which is much 
lower than 1500. If we increases the number of courses to 100 
courses, 100 faculty and 15 timeslot, the number of variables 
become very large and computation time increases 
exponentially. 

Next, we are going to use a simple example to illustrate how our 
model work in real-world. 

For a small business programme in our university, we only have 
5 full-time faculty members and 5 adjuncts. We are going to 
offer 10 courses in the coming year and based on the past 
demand, we know that the number of classes we need to be 
offered is 40. Refer to the input Table 1 below for detail. For 
example, CIO is the foundation course, we need to offer 10 
sessions of CIO in a year, but C3 is an elective so there is only 1 
session offered. Each faculty is asked to fill in a table to state 
their preference as well as up to 3 other differences courses that 
they can teach based on their expertise. Adjunct faculty are 
asked to also fill in up to 3 courses that they can teach. These 
form the input ( Table 1) for our mathematically model. 

Table 1: Input for example: 


Course demand: 


course 

Cl 

C2 

C3 

C4 

C5 

C6 

C7 

C8 

C9 

CIO 

demand 

3 

2 

1 

2 

6 

3 

4 

5 

6 

8 


Faculty preferences and courses they can teach: 


Resources 

1 st 

2nd 

3 rd 

Courses you 
can teach 

R1 

C5 

C3 

C2 

Cl 

C4 

C7 

R2 

Cl 

C4 

C5 

C2 

C6 

C9 

R3 

C3 

C8 

C6 

C4 



R4 

C6 

C8 

C7 

CIO 

C2 


R5 

C2 

Cl 

C5 

C6 

C8 



Resources 

Courses that adjuncts can 
teach 

R6 

Cl 

C3 

C5 

R7 

C2 

C4 

C8 

R8 

Cl 

C5 

C8 

R9 

C3 

C6 

C9 

R10 

C5 

C7 



Number of classrooms available: 3 
Timeslots: tl, t2, t3, ... , tl5 


Table 2: Output from PI 


Resources 

Courses assigned 

R1 

C2(l), C5(l), C7(4) 

R2 

Cl(2), C9(4) 

R3 

C4(2), C6(l), C8(3) 

R4 

C6(2), C8(2), C10(2) 

R5 

C2(l), CS(5) 

R6 

C3(l), C10(l) 

R7 

C10(2) 

R8 

Cl(l), C10(l) 

R9 

C9(2) 

R10 

C10(2) 


We can derive the courses that each faculty needs to teach from 
the table above and share with the faculty who are teaching the 
courses - refer to Table . We can verify that we assign all the 
courses and sessions to faculty refer to their preference list. The 
faculty are also aware that the system will try its best to find the 
best match but it may not be able to fulfill all their wishes as the 
most important problem is to satisfy the operational constraints 
to ensure that all courses are assigned at the end of this stage 
with minimum cost. 

After we have run PI model, we continue with the class 
timetabling problem P2. For timetabling, we need to assign a 
class room and timeslot for each class subjected to room 
availability. The output of this model P2 is a weekly timetable 
for all the courses and their respective classroom. A faculty 
teaches a course, which consists of multiple session of weekly 
classes over the whole semester. A class timetable denotes day 
of week and timeslot in the designated class room weekly, thus 
we will use class and course interchangeable for this purpose. 

In this example, there are only 3 class room available at each 
time slot and there are altogether 15 time slots available each 
week. The output from P2 is shown in 

Table 2 . 

Using the output from PI and P2, we need to schedule the 
faculty timetable. This is done using a simple rule. One faculty 
can only teach one class at any point in time, if there is a clash, 
we swap the time table with another class. 
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RIO 


t7,t8 



Cl 

C2 

C3 

C4 

C5 

C6 

C7 

C8 

C9 

CIO 

tl 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

t2 

0 

0 

0 

0 

1 

0 

0 

0 

1 

1 

t3 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

t4 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

t5 

0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

t6 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

t7 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

t8 

0 

0 

0 

0 

0 

1 

0 

1 

0 

1 

t9 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

tio 

0 

0 

0 

0 

0 

0 

1 

1 

0 

0 

til 

0 

0 

0 

1 

1 

1 

0 

0 

0 

0 

tl2 

0 

0 

0 

0 

1 

1 

1 

0 

0 

0 

tl3 

1 

0 

0 

1 

0 

0 

1 

0 

0 

0 

tl4 

1 

0 

0 

0 

1 

0 

0 

0 

0 

0 

tl5 

1 

1 

1 

0 

0 

0 

0 

0 

0 

0 













Table 2: Output from P2 - (Timetabling) 


Table 3: Matching the faculty assignment with schedule: 


C2 (1) 

C5(l) 

C7 (4) 

t9 

t2 

t5,tl0,tl2,tl3 


R2 


R3 


Cl (2) 

C9(4) 

tl3, tl4 

tl, t2, t3,t4 


C4 (2) 

C6 (1) 

C8 (3) 

til, tl3 

t8 

t3, t6,t9 

(t8 clash so change to t9) 


C6 (2) 

C8 (2) 

CIO (2) 

til, tl2 

t8, tlO 

tl, t2 


R5 


C2 (1) 

C5 (5) 

tl5 

(t9 can't be 
assigned) 

t5, t9, til, 
tl2, tl4 


R6 


C3 (1) 

C10(l) 

tl5 

t3 


R7 


CIO (2) 


t4, t5 



Cl (1) 

C10(l) 

R8 

tl5 

t6 





C9 (2) 


R9 

t6,t7 



CIO (2) 


In this model, the associate dean of the university and the 
system (responsible for assigning the centrally controlled 
resources and interested in optimizing welfare of faculty and 
students) act as a leader, in that it may hold its faculty 
assignment (its strategy) fixed while all other agents (the 
followers) react independently to the leader’s strategy, reaching 
a Nash equilibrium relative to the leader’s strategy. As in game 
theory, this Stackelberg games, and resulting Stackelberg 
equilibria of the model for the faculty assignment is induced by 
a strategy that is precisely the optimal assignment of all of the 
courses. 


V. Conclusion 

In conclusion, we propose an innovative method to solve the 
faculty assignment and timetabling problems for the university 
using two-steps approach. The problem is solved in a short 
running time of 30 minutes and it assists management team in 
term of avoiding tedious and manual planning. It also takes 
faculty preference into consideration for the course, thus the 
outcome sharply reduces the conflict and, thereby, improves 
productivity and yields higher satisfaction of the faculty 
members. By using this approach, we also reduce the number 
of variables and errors in runtime, thus it can be used for small 
and medium size university for their resource planning project. 

The limitation of our model is as two folds. First, we assume 
that the demand for the course is known and it does not vary too 
much since the courses starts. Otherwise, it may not be 
economically viable to run the course if the number of students 
in the class is less than the breakeven number. We can resolve 
this during the operations by cancelling the course, if we realize 
the actual number of sign up is lower than the breakeven 
number. However, this can only be done before the start of the 
course. Once the course starts, even if some students drops off 
half-way and the number of students in the class falls below the 
breakeven number, the course still continues and remains sub- 
optimal. This is to avoid any disruption in students’ study plan 
or delay their graduation. Second, we assume that we always 
have enough adjuncts or faculty members to teach a course. If 
there is a mismatch, the model might not find a feasible 
solution. This can be overcome by ensuring enough faculty to 
teach the course. Otherwise, the course may not be offered until 
we find a faculty to teach it. 

This approach can also be used in other related industry where 
resource assignment or planning is required. In finance, we can 
use our model to solve the allocation of financial advisor to 
potential investor and scheduling appointment. In healthcare, 
we can deploy our model to schedule doctor and patient face- 
to-face appointment or operation theater scheduling based on 
doctor’s preference time slot and operation theater’s 
availability. 

For the next step forward, we want to explore solving the 
problem as one-step approach. We have acknowledged that this 
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problem might be too big or complex to be solved using the 
mathematical model to find the optimal solution within a 
reasonable timeframe. Thus, we will develop some heuristics to 
solve the real-world problem in the near future. 
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Abstract - Now a day, most of the women has been 
suffering from Breast cancer. For assessing and 
segmenting the breast cancer, Mammographic image 
analysis is a vital tool. Most of the researchers in the 
literature have shown that the Retroareolar (RA) region 
of the breast is an important part for detecting the 
cancer there by the performance of algorithms can be 
enhanced. However, all the conventional RA region 
detection algorithms failed to show the reliability and 
most of them are manual segmentation algorithms. 
Here, in this proposal we designed and implemented a 
novel and automated RA region detection in 
mammography images. Our proposed frame work has 
been divided into three steps to detect the RA region 
effectively. Simulation analysis has shown that our 
proposed algorithms outperform the traditional 
approaches. 

Keywords: Mammography , RA region, breast 

segmentation, Hough transformation and morphological 
operations 

I. INTRODUCTION 

The most common female cancer in the world is 
Breast cancer with an averaged 1.67 million new 
cases of cancer have been diagnosed in 2015. While 
the age adjusted incidence rates of breast cancer in 
India is lesser than the foreign countries, because of 
the huge population the load of breast cancer is high, 
about l/3 rd in urban and l/9 th in rural regions. The 
lack of population screening in India undoubtedly 
contributes to this statistic but more importantly, so 
do lifestyle, reproductive and dietary factors. There 
need to be systematic efforts at researching, 
preserving, and promoting those factors that “protect” 
Indian women from breast cancer. Typically, there 
are four main types of breast cancer: ductal 
carcinoma in situ (DCIS) where the cancer is 
confined within the ducts of the breast, lobular 
carcinoma in situ (LCIS) where the cancer is confined 
within the lobules or glands of the breast, invasive 
ductal carcinoma (IDC), and invasive lobular 
carcinoma (ILC). IDC and ILC refer to the type of 
breast cancer where the tumor has spread from the 
ducts or lobules it originated from, respectively, into 
the surrounding tissue of the breast. Other less 
common breast cancers include medullary carcinoma, 
mucinous carcinoma, Paget’s disease of the nipple, 
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Phyllodes Tumor, and tubular carcinoma. Breast 
cancer is grouped into stages which indicate the 
invasiveness of the disease. There are four stages—I, 
II, III, IV—defined by the American Joint Committee 
on Cancer based on a combination of tumor size, 
lymph node involvement, and presence or absence of 
distant metastasis. There is also a more general 
classification: early/local stage where tumor is 
confined to the breast, late/regional stage where 
cancer has spread to the surrounding tissue or nearby 
lymph nodes, and advanced/distant stage where 
cancer has spread to other organs beside the breast. 
There has been a decline in breast cancer mortality 
rates of about 2.3% over the last decade due to 
improved screening techniques leading to earlier 
detection, increased awareness, and improved 
treatments. 

II. RELATED WORK 

Diagnostic of breast cancer in primary stages is vital 
for enhancing the full recovery probability and for 
mitigating the rate of associated mortality. In present 
days, breast cancer’s early detection has been done by 
mammography screening, which is the most widely 
utilized, effective and low cost technique [1]. For 
detecting the breast cancer and their assessment can 
be done by computerized mammographic image 
analysis. In order to study parenchymal patterns [2], 
image-based biomarkers have been applied by 
various researchers on particular regions of interest 
(ROI). In this orbit, conventional works have 
exhibited that the image based biomarkers association 
is superior in the zone immediately behind to the 
nipple, namely the Retroareolar (RA) region [3]. 


Fig.l Example of wrong segmentation 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 





International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


Later on, some amount of research has been done for 
the parenchymal tissue analysis in the RA region [4]. 
In spite of its promising outcomes, the major 
constraint of the said works is the requirement of 
human interaction for manual RA region 
segmentation. First most, subjectively human- 
composed ROIs imply certain restrictions in terms of 
reproducibility and scalability (e.g. the application of 
these methods to large datasets). Secondly, most of 
literature works used fixed and squared ROIs, which 
might be a problem for adapting shapes and sizes of 
wide breast verities. To address the above mentioned 
problems, here we proposed a novel and robust 
methodology for automatic RA region detection in 
mammographic images. We considered the input 
breast image geometry for automatic adjustments to 
diverse shapes and sizes. For this, we have built upon 
recent implementations in the anatomical coordinate 
systems creation in mammographic images [5], [6] 
and [7]. 

III. PROPOSED METHODOLOGY 

Segmentation of breast is arguably the very primary 
pre-processing step in the algorithms of 
mammographic image analysis. Here, we performed 
the segmentation in two steps: scanning artifacts 
removal and contour detection. Tape artifacts are 
markings left by tapes, or other shadows that appear 
as horizontal or vertical running strips in an image. 
Since these are straight lines, the algorithm 2 has 
used to segment the foreground and for detecting the 
artifact lines. 

Algorithm 1: Breast Segmentation 
Case 1: RMLO view 
Case 2: LCC view 

Step 1: Select and read an input mammographic 
image 

Step 2: Find out whether the side is left (L) or MLO 
using string comparison 

Step 3: If the side is left then flips it horizontally 
Step 4: Now, detect the breast foreground using 
algorithm 2 

Step 5: Detect the chest walls in the case of MLO 
using algorithm 3 

Step 6: Finally, overlay segmented mask on input 
breast image 

For the second step, contour detection in breast, a 
statistical chest wall segmentation technique was 
used in this proposal work. By using algorithm 3 after 
step 6, the post processing step will be done for 
obtaining the binary mask by means of 
morphological operators in order to remove spurious 
artifacts and the breast contour is smoothed. 

Algorithm 2: Foreground segmentation 
Input: I and binary flag 

where I = Grayscale mammography image with size 


of mxn and binary flag = 1 if the input is MLO view 
Output: out_mask 

Step 1: Initialize input parameters and find the central 
region 

Step 2: Calculate maximum and minimum intensity 
values 

Step 3: Find out the intensity threshold using the 

relative frequency and convolution 

Step 4: Artifact removal with connected components 

labeling by keeping largest left-most cluster 

Step 5: Now, find the contour points using region 

boundary tracing 

Step 6: Finally, obtain the out_mask by using 
curvature analysis 

Algorithm 3: Chest wall segmentation 

Input: grayscale mammography image 

Output: segmented chest wall 

Step 1: First, crop the mammography image 

according to the contour 

Step 2: Apply dilation and filtering for pre-processing 
of input mammography image and replace Step 3: the 
lower right corner with zeros 

Step 4: Now, detect the pectoral line using Hough 
transformation then calculate the edge co-ordinates 
Step 5: Calculate the accumulation array ‘A’ by 
quantizing the parameter spaces 
Step 6: Find maximum in ‘A’ 

IV. SIMULATION RESULTS 

All the experiments have been done in MATLAB 
2016b environment with 4GB RAM and Intel 
processor. The Digital Database for Screening 
Mammography (DDSM) [9] dataset has been utilized 
for the testing purpose, which consists nearly 2500 
mammogram cases from various medical institutions 
in United States of America. Here in our experiments, 
we had considered the images with LCC and RMLO 
view by scaling them with factor 8 and later on 
converted to unit8 for enhancing the processing 
speed. 


RMLO view LCC view 



Fig.2 Segmented RMLO and LCC view 
mammography images 


Table 1 show that the number of tested 
mammography images with normal and abnormal 
conditions. The number of correctly segmented count 
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also measured in the table with their accuracy in % 
values. Totally we tested 72 images and got an 
accuracy of 73.61% with our proposed methodology. 
Figure2 show that the segmented output images of 
RMLO and LCC view images with RA region 
detection, which have been obtained by utilizing our 
proposed algorithms discussed in section III. 


Table I: Performance of breast segmentation 


Type 

Analyzed 

Correct 

Accuracy 

Normal 

47 

35 

74.44 % 

Cancer 

25 

18 

72% 

Total 

72 

53 

73.61 % 


Performance of proposed and conventional RA 
region detection in terms of accuracy has shown in 
figure 3. The relation between specificity and 
sensitivi ty shown in figure 4 



Accuracy in 

% 


Traditional Proposed 


Fig.3 Performance of proposed and traditional 
methods 

Experimental results show that the proposed method 
outperforms selected ROIs for the cancer detection 
task using proposed method. 



specificity 

Fig. 4 Relation between specificity and sensitivity 
V. CONCLUSIONS 


approaches for better segmentation accuracy. We can 
also develop RA region detection in 3D 
mammography images. 
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Abstract 

We propose an approach to database security that exploits existing DBMS facilities to associate a 
separately maintained checksum value with critical data. Using our approach, a database’s content, 
domain and referential integrity remain the responsibility of the DBMS, however, when critical data is 
manipulated checksum values are computed and stored in a separate database. Using this combination 
of databases, applications which access critical data can only access such data via checksum values 
ensuring that data is created and accessed in a secure manner. 

Key words- DBMS, Security, Checksum, entity integrity, domain integrity, referential integrity and 
user-defined integrity. 


1. Introduction 

The primary responsibility for ensuring the 
security and integrity of a database lies with 
the Database Management System (DBMS) 
[1]. At present, the most widely used 
Relational DBMS’s (RDBMS’s) provide 
several ways of ensuring data integrity at the 
entity, domain, referential, and user-defined 
levels. The latter of these levels, i.e. the user- 
defined level, involves constraints on the 
forms of update that can be performed, and 
such constraints are usually enforced via 
triggers either by the database itself or by 
applications that access the database. 

Checksums have been the subject of research 
and practical application for many years. 
Gopalan [5] used checksums to enhance the 
integrity of a conventional file system via a 
block-level checksum computed per-block for 
all data blocks in a file, which is indexed by a 
relative block number. In such an approach, 
files are modified to include a set of additional 
references that point to the checksum blocks, 
and the addition of a block to a file involves 
the computation and storage of its checksum 
blocks. Crocker [6] introduced a checksum- 
based “Spam Detection” engine for Lotus 
Domino mail systems called Block which also 
exploits checksums. The system computes a 
checksum for each message that is classified as 
“Spam” and maintains the checksums in a 
database which is replicated between the mail 
system and the mail server. The database 
enables the server, on receipt of each message, 
to compute a checksum and compare it with 
those checksums already in the database. If a 
match is found, it is likely that the associated 
message is further “Spam”. Network 


Appliance Inc. technical report [7] describes a 
technique for reducing the volume of data 
transfer during Backup and Restore on UNIX 
platforms. The system uses a checksum to 
identify portions of a file that have changed 
since a previous and current backup. Changed 
blocks are identified via checksum values 
computed and maintained for each block. 
Sabartnam [8], uses checksums to detect errors 
in database manipulations including 
transactions, locks, logs, and data buffers. In 
this case, a checksum is added to the object 
being manipulated, and a checksum field is 
attached to the access method and also to each 
of the objects in associated hash buckets. A 
hash vector and object checksum are 
recomputed during the restart of a DBMS and 
compared with stored checksum values. 

We describe an approach to database security 
that similarly exploits checksums. First, in 
Section 2, the forms of data integrity that must 
be maintained in a conventional RDBMS are 
examined. In Section 3 we present our 
approach to maintaining user-defined integrity 
using error detection and correction algorithms 
similar to those implemented in network 
protocols, i.e. the Hamming Code Protocol. [3] 
[4]. An implementation of the proposed 
approach is explained in detail in Section 4, 
Section 5 test and Section 6 examines 
properties of the implementation. 

2. Data Integrity in an RDBMS 

Data Integrity levels in an RDBMS can be 
classified into four main kinds, i.e. entity 
integrity, domain integrity, referential integrity 
and user-defined integrity [2]. 
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Entity Integrity: Useful at the row level in a 
table, entity integrity ensures that a relation 
does not have any duplicate rows and that each 

- have a unique primary key that can be defined by one 
or more of its attributes. 

- Domain integrity: Values in a column in a table must 
be drawn from some well-defined “domain” of values. 
This is the simplest form of an integrity constraint 
which is maintained at all times and in all 
circumstances. In effect, a “domain” corresponds to 
type in a conventional programming language, and 
values in a column must be drawn from one of the 
available types. 

- Referential Integrity: This is applied at the table level 
such that values available in one relation are available 
and synchronized with those in other relations. 
Referential integrity is enforced with a primary key and 
Foreign Key (FK) combination. A foreign key 
comprises one or more columns in a “child relation” 
whose values are synchronized with those in the PK in 
a “parent relation”. The FK accepts only those values 
that exist in the PK in order to maintain the integrity of 
the relation. Referential integrity is preserved when 
applying any Data Manipulation Language (DML) 
operations, i.e. insert, update and delete operations, via 
the following constraints on the application of such 
operations:- 

1. Restricted: Disallow data modification. 

2. Cascaded: Extend the data modification on 
parent relation to all child relations. 

3. Nullified: Set the values of matching FK's in 
child relation’s to the value NULL. 

- User-Defined Integrity: In effect, user-defined 
constraints on the ways in which a database can be 
manipulated. Such constraints are the responsibility of the 
system administrator, who will administer access rights 
and enforce rules and regulations. 

The technique described in the following section is 
intended to deal with user-defined integrity, a form of 
integrity typically enforced via triggers and constraints 
[9]. Our technique is intended to augment conventional 
approaches to user-defined integrity by exploiting 
conventional triggers and constraints to ensure that access 
to critical data is only possible via checksums. 

3. Database Checksum Requirements and 
Prerequisites 

There are several requirements that our approach must 
satisfy:- 

- As the potential exists for “illegal” data modifications 
to change all the stored data, any solution should be 
able to both detect and correct all_the modifications to 
data in such circumstances (completeness) 


row in a relation has its own unique identifier 
which distinguishes it from other rows. This 
unique identifier is termed the Primary Key 
(PK), and each relation must 

- The solution should not be restricted in the domains 
whose values may be used in the computation of the 
checksums (no special cases) 

- The solution should involve no additional overhead 
where legal modifications are made (efficiency) 

- The solution has the prerequisite that both the original 
database and its associated checksum database are 
available to all applications which access the original 
database, and that all such applications correctly 
maintain both the original database and the associated 
database of checksums (correctness) 

- Execution time overheads and additional space 
requirements should both be minimal (efficiency) 

3.1. Our Approach 

Our approach involves the use of two databases, the first 
database contains conventional data of some kind and the 
second stores associated checksum values, i.e. a result and 
a remainder, that are associated with critical data in the 
first database. Figure (1) shows to example databases 
named “DB1” and “CS1, “DB1” is a conventional database 
is containing values of any permissible type, e.g. integer, 
decimal, date including time, etc. The checksum values for 
critical data in “DB1” are stored in “CS1”. 



Figure (1) 

Checksum values are calculated as the modulus value of a 
data item (the item being first converted to its ASCII value 
if necessary) using a divisor chosen by the system 
administrator. 

The equation for the recalculation of the original values 
using the stored checksum related values: 

Di = G(x) *P i +R 1 . 1 
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where D f is the recalculated value, G(X) is a fixed regular 
divisor chosen, e.g. 1028, P t is the result of the division of 
Mj by G(x), and Rj is the modulus of M t and G(x) are 
given by, 

Pi = Mi /G(x) 2 

The modulus value is calculated using equation (3) as 

shown bellow: 

Ri = Mi - (G(X)*( Mi \ G(X))) 3 

Where, M, is assumed the i th datum in a database of N 
data sets, \ is the remainder, in this case P t and R t are the 
checksum values for the i th datum in a database of N sets 
of data, were (AGs 1,2,3, .AO, i.e. 

By substituting equations (2) and (3) into equation (1) we 
yield 

Di = G(x) * Mi /G(x) + Mi - (G(X)*(Mi \ G(X))) (4) 

Equation (4) can be further simplified to give, 

Di = 2Mi - (G(X)*(Mi \ G(X))) (5) 

The checksum values P t and R t are maintained separately, 
e.g. in the database called “CS1” in Fig. 1., and used later 
to validate the integrity of the i th data item stored in the 
original “DB1” database. When a data item in the “DB1” 
database is to be accessed by an application, the 
application must a) compute checksum values, and b) 
compare the computed values with those stored in the 
“DB1” database. If the computed and stored checksum 
values differ, then the associated critical data in “DB1” 
database is assumed to have been changed illegally, and 
the stored checksum values for the i th data item are used 
to recalculate the original values and restore the database 
to a consistent state. 

Consider, next, the worked example below:- 

i. Integer Test 

Let Mi denote a “critical” credit card number that is to be 
inserted in a “DB1 ” database of account information, and 
G(x) denote a fixed regular divisor 1028. For simplicity, 
we assume that the initial balance in all accounts is £0.00. 
Table (1) bellow, shows a table of stored values in the 
Data base with their related Check sum values. 


Mi 

G(X) 

Pi 

Ri 

Di 

7654321678945320 

1028 

7445838209099 

548 

7654321678945320 


Table (1) stored data 


Keeping in mind that the checksum values for Pi and Ri 
are stored in the “CS1” database and the value M L is 
stored in “DB1” database together with other account 
data including the opening balance. 

Assuming that subsequent to the initial insertion of the 
account numbered 7654321678954320 into the “DB1” 


database, an illegal update takes place and the credit card 
number is changed to (7654321678955555). Table (2) 
bellow shows the content of the data base after the illegal 
modification has taken place. 


Mi 

G(X) 

Pi 

Ri 

Di 

7654321678955555 

1028 

7445838209099 

548 

7654321678954320 


Table with illegally changed Mi Value (2) 


Any query issued on this account (or any other critical 
data) will lead to an automatically re-computing the 
checksum values using equation (5) which will be followed 
by a comparison of both recomputed and stored valued of 
the checksum, comparing the computed values with the 
corresponding stored value as shown in Figure (2). Which 
lead to a restoration of the original value in the database to 
preserve database integrity? 



Figure (2) 


If Mi = Di then the critical data is assumed to be valid, 
otherwise the critical data is assumed to be incorrect, but 
can be recovered provided only that the values of the data 
involved were inserted by an authorized user. 

ii. Character Test 

Assuming that we have a string Mi whose value is ”Rob” 
that is to be inserted into a “DB l”database whose fixed 
divisor, G(x), is 1028. The string is first converted to a 
sequence of values that correspond to its character’s ASCII 
codes. However, all such ASCII values have the value 100 
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added to ensure that they are 3 digits in length. The 
resulting ASCII values are concatenated together to 
obtain a single number. 

Assuming that Mi = “Rob”, the ASCII equivalent is 
8211198, i.e. ASCII for R is 82, for o is 111 and for b is 
98 giving 182 211 198 when the necessary 100 is added. 


If G(x) is 1028, the values for Pi and Ri are shown in table 
(3) bellow 


Character 

Set 

M; 

G(X) 

Pi 

R; 

D; 

Rob 

182 211 198 

1028 

177248 

254 

182 211 198 


Table (3) stored data 


Assuming that subsequent to the initial insertion, an 
illegal update occurred and the name was changed to 
“John”. Any query issued on this critical data requires the 
checksum values to be recomputed and compared with 
the stored values, i.e. the value associated with the initial 
insertion of “Rob”, using equation (5). 

Each 3 digit substring of Di represents a character, which, 
when 100 is subtracted from the three digit substring 
yields the ASCII codes 82, 111 and 98 enabling the value 
in the “DB1” database to be recovered. 

4. An implementation of the proposed solution 

For simplicity, it is to be assumed that all access to data in 
a “DB1” database is protected through the use of an 
application view [12], and direct access is forbidden by 
the DBMS’s access policy which is set by the database 
administrator and subject to any other third-party security 
system, e.g. a Firewall [10, 11]. 

Application triggers then perform the calculation of 
checksum values, i.e. result and remainder Pi and Ri 
respectively and their storage in the “CS1” database. 

Consider, next, a customer payment system, composed of 
several tables, among which are the three tables shown in 
Figure (3) below. 


CardNo Field in the CardDetails table. 

ClientName field in the Client table. 

As suggested earlier, direct integrity rules provide basic 
integrity protection, but are unable to prevent illegal 
changes to values stored in a database, and similarly unable 
to prevent the illegal exposure of such values. In such 
circumstances database transactions, both ’’legal” and 
“illegal” will run normally and it will be assumed (even in 
the event of an illegal transaction) that the integrity of the 
database is preserved and uncompromised. 


4.1 Validation Process 

The validation process is concerned with data integrity; this 
process is carried out by recalculating the data related to 
the data checksum value using a pre saved indicators 
related to this checksum, and comparing those values with 
available data values. The validation algorithm is shown 
below: 

Begin 

Identify record in Client table; 

Obtain CardNo value for defined record; 

Obtain result, remainder values for 

defined record; 

Compute Checksum 

Checksum = result X G(x) + Remainder 

Convert checksum to string; 
if checksum - 0 and 

(CardNo value = 0 or CardNo = null) 

{ validate = true;} 

else if checksum ± 0 and CardNo -null 
{ validate - false;} 
else 

{if checksum = CardNo 
{validate - true;} 

Else 

{validate = false;} 

} 

End 



Figure (3) Case Presentation 


Our interest here is to preserve the integrity and security 
of the following fields: 


4.2. Number implementation 

This example will be applied to the CardDetails and 
NCalculations as shown in Figure (4): 


CardDetails 

N 

NCalculations 

\ 

TransactionNo PK 

/ 

RecodNO 

CardNo 

A * 

TransNo 

ClientNo FK 

/ 

CalResult 

CalRemainder 


Figure (4) Presented Case 


PaymentAmount Field in the Payment table. CardDetails Table: Any illegal changes to values of the 

CardNo in this table must be detected and restored. 
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NCalculations Table: Used to maintain the checksum 
values for corresponding CardNo values. 

Our approach allows users to perform basic data 
manipulation operations such as Insert, Update, Delete 
and Select. Two types of triggers are then used, i.e. 
conventional triggers [9] and block-type triggers [9]. 

Conventional Triggers 

Pre-Query Trigger. Used to delete any data inserted 
illegally in “DB1 M database by unauthorized user, 
and to check user authorization, see Figure (5). 



Figure (5) 


- Pre-Insert, Pre-Update & Pre-Delete Triggers. 

These triggers check user authorization, and if the 
user is unauthorised an error message is generated. 
See Figure (6) 



Figure (6) 


Block Triggers 

- Pre-Insert Trigger: This trigger calculates the 
checksum values i.e., Pi and using equations 1 and 
2, and stores the values in the NCalculations table in 
the “CS1” database. See Figure (7) 



Figure (7) 

- Post-Query Trigger: This trigger is used after any 
query to re-compute the stored checksum values using 
equation (3) to regenerate the original values stored in 
the “CS1” database. Recomputed values are compared 
with the values stored in the transactional table in 
“DB1”. If they are not the same, the recomputed value 
will is stored in “DB1”. See Figure (8) 



Figure (8) 


- Pre-Update Trigger: This trigger provides the means to 
update the values record in “DB1” and “CS1”. 
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Updates are forced on the Ncalculations table which 
contains the values Pi and Ri. The update uses the 
recalculated checksum values i.e. the Pi and R { . See 
Figure (9). 



Figure (9) 

- Pre-Delete Trigger. This trigger deletes any values 
from the transactional table in the “DB1” database 
and the related checksum values in the “CS1” 
database, i.e. delete the Pi and Ri 

4.3 Character implementation 

This implementation example involves the tables in 
Figure (10). The Client Table contains the client 
transactions information. Illegal changes to the 
ClientName in this table must be detected and restored. 
The Calculations Table stores the checksum values of Pi 
and Ri s i.e. the result and remainder. 

The implementation must provide users with a means of 
performing the basic DML operations Insert, Update, 
Delete and Select. Two types of triggers are used, i.e. f 
Conventional triggers [9] and block triggers [9] of the 
same kind used the implementation in Figures 5 and 6. 


Client 

. 

lx 

Calculations 

ClientNo PK 

ClientName 

Address 

Telephone 

Email 

Fax 


RecodNO 

TransNo 

CalResult 

CalRemainder 



Figure (10) 


Block Triggers 

- Pre-Insert Trigger: This trigger converts the text 
ASCII codes and calculates the checksum values Pi 
and Ri which are then stored in the Calculations table 
in the “CS1” database. See Figure (11) below. 


- Pre-Update Trigger: 

This trigger provides the means to update the values in the 
“DB1” database as well as the checksum values in the 
“CS1” database. Updates are forced on the Calculations 
table containing the checksum values Pi and Ri. See Figure 
(12). This update uses the recalculated checksum values for 
Pi and Ri. 



Figure (11) 



Figure (12) 

- Post-Query Trigger: This trigger is used after all queries 
and re-computes the stored checksum values using 
equation (3), i.e. it re-computes the ASCII values and then 
converts the ASCII codes to a string. The checksum values 
in “CS1” are then compared with the data stored in the 
transactional table in “DB1” and, if they are unequal, the 
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correct value is automatically inserted into “DB1”. See 
Figure (13). 

Using our approach, conventional user authorization is 
maintained via a Username & Password combination 
which must be provided when connecting to a database. 
In addition Roles Assignment \9] ensures that authorized 
users are granted the required role and that their 
transactions are thus valid. In our examples, users were 
assigned the role “CHECKSUM” in order that they are 
able to make legal changes. 

5.0 Test and Results 

As a simple test to determine the correctness of the 
proposed approach, 11 random records have been entered 
into a “DB1” database. The test data was then updated by 
illegal users using direct access to the “DB1”.database 
through SQLPLUS where all ClientName and CardNo 
attributes where updated and committed as shown below. 



Figure (13) 


5.1 Character Test 

The following SQLPLUS command was used to check 
the client name from the table called client. 

SQL> select clientname from client; 

The returned result as displayed below:- 


CLIENTNAME 


Allen Marlard 
Ward Jones 
Martin King 
Adams Blake 
Martin Jones 
Scott Clark 
King 

Turner King 
Adams 
James Miller 
Miller King 

11 rows selected. 

Assuming the client names in the table client are to be 
changed illegally by directly connecting to the SQL to 
“DB L’database, the illegal update to be executed and 
committed requires the following commands: 

SQL>update client set clientname =' Oliver'; 

11 rows updated. 

SQL>commit; 

commit complete. 

If we recheck the constants in the table client in the 
“DBl”database, all client names have been changed to 
OLIVER as shown below. 

SQL>select clientname from client; 

CLIENTNAME 


Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

Oliver 

11 rows selected. 

It should be mentioned here that the critical data in the 
“DB1” database at each query by the proposed application 
is checked by re-computing the checksum values and 
comparing them with the data stored in the “DBL’database. 
If the data values are equal the content is considered to be 
OK, however if they differ the original values are restored 
using the computed checksum values. Assuming that a 
legitimate user queries the “DB1” database using our 
application, and if at the same time checks are made on the 
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content of the table client it can be seen that the original 
data has been restored. 

We select the ClientName again from SQLPLUS as 
below. 

SQL> select clientname from client; 

CLIENTNAME 


Allen Marlard 
Ward Jones 
Martin King 
Adams Blake 
Martin Jones 
Scott Clark 
King 

Turner King 
Adams 
James Miller 
Miller King 

11 rows selected. 

This test determines if an illegal user inserts a record into 
the “DB T’database, and, if so, the illegal insertion is 
removed from the “DB1” database as follows. 

First the record is inserted and committed:- 

SQL> insert into client values (100,'CATHIE','88 Park 
Road', 

'0161383621','Cathie@hotmail.com','0161383620',1); 

1 row created. 

SQL> commit; 

Commit complete. 

A recheck of the table client then indicates that Cathie 
has been added in the as shown below. 

SQL> select clientname from client; 


CLIENTNAME 


Allen Marlard 
Ward Jones 
Martin King 
Adams Blake 
Martin Jones 
Scott Clark 
King 

Turner King 
Adams 
James Miller 


Miller King 
Cathie 

12 rows selected. 

Queries on the “DB 1 ” database using our application forms 
involve a check on the content of the table client , and it can 
be seen that the original data has been restored and the 
illegal record has been removed, i.e. selecting the clients as 
shown below: 

QL> select clientname from client; 

CLIENTNAME 


Allen Marlard 
Ward Jones 
Martin King 
Adams Blake 
Martin Jones 
Scott Clark 
King 

Turner King 
Adams 
James Miller 
Miller King 

11 rows selected. 


5.2 Number Test 

The same type of test was performed on the CardDetails 
table. The results were the same as the character test. In the 
first test, the CardNo was updated by an illegal user and it 
successfully restored, in the second test a new record was 
illegally inserted, then detected and removed. 

The two tests have been carried out on an implementation 
of out model. In both tests the system suffered from illegal 
access via direct access to the “DB1” database where 
several records were updated and/or new records inserted 
in the Client and CardDetails tables. 

The test results indicate that the model is working, and the 
performance of the system was monitored during tests and 
was not adversely affected by the overhead of computing 
and using checksum values. 

Table (4), shows the actual processing times, for the update 
comparison operation on the test database, the test was 
carried out using different database sizes. As it can be seen 
that the data integrity recalculation process carried out 
affects the performance of the system, however this effect 
is within the acceptable time margins and would not affect 
user during a normal run of the system as the delay is 
negligible. 
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No. of 

Records 

Scenario 1 
regular DB 

Milliseconds 

Scenario2 DB 

with Checksum 
Milliseconds 

Checksum 

Overhead 

10,000 

31,236 

68,406 

2.19 

20,000 

62,187 

137,813 

2.22 

30,000 

94,000 

203,765 

2,17 

40,000 

127,406 

271,765 

2.13 

50,000 

158,094 

340,219 

2.15 

Average 

2.17 


Table (4) Test Results 


Figure (14) shows the actual test data representation of 
both scenarios the first being without checksum 
parameters and the second scenario with the checksum 
parameters. 



Fig (14) Oracle 10G Checksum Update overhead 


6. Conclusions 

We have presented an approach that can be used for all 
types of data; this approach is based on using a checksum 
validation algorithms applied only to critical data such as 
credit card numbers. This approach protects the data from 
any illegal changes and guarantees data integrity. In this 
case the checksum values can be stored in the same 
database as the regular data or it can be stored in a 
separate database i.e. “CS1” database, this will increase 
the level of security. It was also shown that the algorithm 
provide the necessary data integrity needed for any 
system regardless of it size and or data type. The 
proposed data checksum mechanism has an advantage 
that it can detect whether data has been modified, and in 
this case the algorithm compute the original data, this will 
provide a data integrity mechanism to protect the data. 


Future work can be implemented based on this algorithm 
were the integrity of encrypted data can be maintained, in 
this case the checksum values can be calculated for the 
encrypted data, the checksum values are then stored to be 
used at any time to reconfirm the encrypted data validity. 
This will protect encrypted data from any illegal changes 
that might take place by unauthorized users. Therefore this 
algorithm is capable of protecting any data type from 
illegal access and or changes, especially changes such as 
replacing data by similar type but different in value into the 
database. 
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Abstract 

Recently, the population of Jenin city is increasing rapidly and this 
amplifies the need for more infrastructural objects such as hospitals. 
Hospitals are considered among the most important infrastructural 
constructions in cites as they provide health care services. However, 
current hospitals and medical resources are limited and randomly 
allocated in Jenin city. Accordingly, in this paper we propose a 
suitability model that employs Geographic Information System 
(GIS) based Multi-Criteria Analysis (MCA) with Analytical 
Hierarchy Process (AHP) to identify suitable locations for building a 
new hospital in Jenin city. An experimental instantiation of the 
proposed model is instantiated and the produced results show that the 
majority of suitable areas are located in Northeast of Jenin. This is 
mainly because Northeast of Jenin is away from industrial areas and 
dumping sites. 

Keywords: Hospital site selection, GIS, MCA, AHP. 


I. Introduction 


Recently, the number of people shifting from rural to urban 
areas is increasing rapidly. This rapid increase of urban 
population creates various social, economic and 
environmental changes such as unplanned sprawl, inadequate 
housing facilities, traffic congestion, insufficient drainage, 
and lack of health facilities [1]. It is very important to provide 
all facilities and infrastructural constructions in urbanized 
areas to overcome the rapid urban growth. Therefore, it 
becomes the government’s responsibility to provide the 
required resources and facilities for urban areas on proper 
locations. 

Hospitals are among the most important facilities that have a 
vital role in providing health care services. Identifying the best 
locations for new hospitals is an important issue due to the fact 
that selecting suitable locations will help the government to 
optimize the allocation of medical resources, simplify social 
contradictions and control the health care development in rural 
and urban areas. On the other hand, appropriate hospital site 


selection will help people reach hospitals easily, reduce the 
time of rescue and improve the quality of life [2]. 

Current hospitals in Jenin city are randomly distributed and 
arbitrarily allocated due to the unreasonably distribution by 
the government. For example, inside the city center of Jenin, 
hospitals are saturated. With the growth and extension of 
Jenin, the population increases rapidly and spreads into areas 
outside the city center and the contradiction between supply 
and demand for hospitals is becoming severe. Moreover, there 
is a persistent need to build quality hospitals that provide 
professional health care services due to the limited high 
quality medical buildings. 

Several studies employed GIS techniques and products to 
address the problem of identifying the best locations for 
building hospitals and planning health services [3,4]. Most of 
these studies take into account several parameters to allocate 
suitable sites for building hospitals such as existing hospitals, 
population, economical factors, pollution, and other laws and 
regulations. 

In this paper, we aim to identify the most suitable areas for 
building a new hospital in Jenin city. In order to achieve this 
goal, we will exploit GIS products and methods with MCA in 
addition to AHP. By this, we mean that the study will take into 
account many factors such as existing hospitals, proximity to 
main roads, and distance to polluted and industrial areas. After 
that, we will assign them different weights (according to their 
importance) based on AHP. 

The main contributions of our work are summarized as 
follows: 

• Employing GIS-based MCA in order to identify the 
best locations for building a new hospital in Jenin 
city. 
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• Exploiting AHP to assign weights and scores for the III. Study Area 


identified criteria (i.e. factors) in order to select the 
best location for the new hospital. 

The rest of this paper is organized as follows. Section 2 
presents the related work. A general overview of the study 
area is presented in section 3. Section 4 presents the general 
architecture of the proposed model and the implementation 
details of the suitability model. Experimental validation and 
evaluation of the proposed model is presented in Section 5. In 
Section 6, we draw the conclusions and outline future work. 

II. Related work 


We will clarify our contributions in the following paragraphs 
by offsetting them with prior related work. Several studies 
have employed GIS techniques and methods in health services 
and for planning public health [3, 4 and 5]. For example, the 
authors in [6] combined GIS with Location Based Services 
(LBS] in order to settle the affairs of emergency medical 
incidents. On the other hand, other authors employed GIS 
techniques and methods in selecting the best site for building 
health care facilities. In order to build constructions that 
provide health care facilities, various parameters (i.e. factors) 
can be considered to identify the most suitable sites such as 
existing health care facilities, population, economic factors 
and pollution. These parameters can be classified, analyzed 
and integrated together in different methods. For example, 
MCA is used to identify factors that affect building new health 
care objects in [7]. While in [8], the researchers employed 
both GIS and Analytical Hierarchy Process (AHP) to 
determine the parameters that affect the physical accessibility 
of neurosurgical emergency hospitals in Sapporo city. At the 
same time, the authors of [9] exploited AHP to evaluate the 
appropriateness of the location selected for Taiwanese 
hospital. 

Although AHP allows multi-criteria decision-making, it 
suffers from the fact that there are hidden assumptions like 
consistency. Besides, it is difficult to use when there is large 
number of criteria. To overcome these problems, Fuzzy 
Analytical Hierarchy Process (FAHP) is used later in for 
hospital site selection [10]. 

In our proposed work, we aim to employ MCA based on GIS 
methods and techniques to identify the best site for building a 
new hospital in Jenin city. Besides, we will exploit AHP to 
assign weights for the factors that affect the new hospital site 
selection. According to the produced results, we can prove 
that GIS-based methods and tools play a vital role in making 
effective decisions in the health field. 


The study area is Jenin governorate. It is located in the north 
of West Bank as shown in Figure 1. In 2016, the city had a 
population of 318,958 according to the census by the 
Palestinian Central Bureau of Statistics [11]. It is located 
about 43 Kms north of Nablus, and it is about (100-250 m) 
above sea level. The name of Jenin was derived from Ein 
Ganim meaning “the spring of Ganim” and referring to the 
region’s plentiful spring. 

Jenin is under the administration of the Palestinian 
Authority. Today, Jenin is built on the slopes of a hill and 
surrounded with different types of trees such as carob, fig, 
and palm trees. It is distinguished by its agriculture, 
producing various types of crops. 



Fig. 1: Jenin governorate in West Bank 


Jenin governorate has 82 localities and one camp, and we 
divided the study area into three main regions: Jenin city, 
Jenin camp and villages that belong to Jenin governorate. 

During our work, we focus on Jenin city that has three main 
hospitals. The details of these hospitals are given in Table 1, 
and their locations are illustrated in Figure 2. 


Table 1: Existing hospitals in Jenin city. 


Name 

Specialization 

Administrated 

by 

No. 

of 

beds 

Al-Razi 

Hospital 

General 

Private sector 

60 

Al-Amal 

Hospital 

General 

Private sector 

20 

Jenin 

Government 
al Hospital 

General 

Government 

120 


50 
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Fig.2: The main hospitals in Jenin govemorate. 


IV. Data and Methodology 

In this section, we present the methodology used in our 
proposed model in order to identify the optimal site for 
building a new hospital in Jenin city. Figure 3 depicts the steps 
of our proposed model. 



Fig.3: A flow chart depicts the methodology used in order to identify 
suitable locations for building a new hospital. 


As shown in figure 3, the proposed model starts with 
collecting data from various resources to represent the aspects 
of the study area. Some of those data are collected from online 
resources such as GeoMOLG [12] and others are manually 
digitized with a suitable scale. After that, the collected data 
are managed and prepared for use. By this, we mean that the 
data are organized and stored efficiently for further analysis. 
Then, analysis objectives and criteria are identified to use 
them in further steps. Once analysis criteria are identified, 
they are assigned weights based on AHP. Those weights are 
used to indicate the importance of each criterion. After that, 
during the data analysis step, various GIS tools and methods 


are employed in order to produce the set of suitable locations 
for building a new hospital. 

A. Data Collection 

In order to find the best location for building a new hospital, 
we need to collect data in the format of vector shape files. 
These shapefiles are collected from GeoMOLG [12] and some 
of them created by digitizing maps obtained from Jenin 
municipality. Once the vector data are collected, they are 
converted later during the analysis step into raster format. 


B. Data Managment and Preperation 

During this step, the collected data are prepared to be used in 
the analysis process. Data are often collected with missing 
values and errors, so we need to correct these errors and 
organize the data in datasets and geodatabases. The process of 
correcting the data and integrating them into feature data sets 
constitute a vital role in this step. 

Additionally, it is important in this step to answer some 
questions about the collected data such as: 

• What is the data format? 

• At what scale it was collected? 

• Are the data projected? 

• Does the data have all the needed attributes? 

• Does the data have constraints and the features 
geometry support the analysis process? 


C. Identify Objectives and Criteria 

In our proposed model, we aim to select the optimal site for 
building a new hospital. Various factors have been involved in 
the selection process including the following: 

1) Land use. 

2) Distance to existing hospitals. 

3) Intersection with main roads. 

4) Distance to dumping sites. 

5) Distance industrial areas. 

6) Elevation. 

These factors are divided into three main types: 

1 - Technical factors: these factors have a clear impact on 
the construction process and they include the 
elevation, the slope, distance to existing hospitals and 
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the land use of the proposed site. The land use refers 
to how the land being used by human. While the 
distance to existing hospitals is how fare the new 
hospital from other hospitals in the same city. 

2- Environmental factors: there is a strong relationship 
between human and the environment. The main 
environmental concerns that may affect hospital site 
selection are noise and pollution. And thus, the new 
hospital should be away from noisy and polluted 
areas such as industrial areas and dumping sites. 

3- Socio-economic factors: these factors mainly 
includes proximity to transportation and main roads. 

D. Assign Criteria Weights (AHP) 

In this step, we assign weights and scores for the identified 
factors in the previous step based on AHP. AHP has been 
widely exploited in health-care and medical related problems. 
The following steps are used to assign weights for all identified 
factor based on AHP: 

1- Layout and expose the overall factors. 

2- All factors can be compared using pair wise 
comparisons in order to generate weights for factors 
through distributing questionnaires on experts. In pair 
wise comparisons, we decide which factor is more and 
how much important than another using 1-5 scale as 
shown in Table 2 [13]. The produced wrights quantify 
the importance of factors in the analysis and decision 
making process. 

3- Check the consistency ratios of all pair-wise 
comparisons. 

In this step, we use the Consistency Index (Cl) and 
Consistency Ratio (CR) formulas to check the 
consistency as follows: 

C/ = Ana* (D 

Where: 

n: the number of criterion. 

A max : the biggest eigenvalue of the comparison matrix. 

CR = Cl/RI (2) 

Where: 

RI: a constant corresponding to the mean random 
consistency index value based on n. 

4- The relative scores are aggregated using geometric 
mean method. 


Table 2: Pair wise comparison scale 


Verbal judgment 

Explanation 

Number 

Extremely Un¬ 
important 

A criterion is strongly 
inferior to another 

1/5 

Moderately Un¬ 
important 

A criterion is slightly 
inferior to another 

1/3 

Equally Important 

Two factor contribute 
equally 

1 


Moderately 

Important 

Judgment slightly 
favor one criterion 
over another 

3 

Extremely 

Important 

Judgment strongly 
favor one criterion 
over another 

5 


E. Data Analysis 

In this step, a model is developed in order to identify the 
optimal location for building a new hospital. In this model, the 
raw data should have the same spatial reference and they are 
converted into a raster with the same cell size, making them 
easier reclassified in Analysis steps. 

The data anlysis steps and tools are detailed as follows. 

• Distance to existing hospitals based on network 
analysis: as detailed earlier, there are six factors 
taken into account for building our model. In this 
step, we derive a series of polygons (service areas) 
that represent the distance that is required to reach 
each hospital. As a prerequisite to finding the service 
areas, we need to construct a network dataset. The 
results of applying this step are depicted in Figures 4 
and 5. 
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+ Error hospitals 
Located hospitals 
?0 Unlocated hospitals 



Fig.5: Distance to hospitals based on Service Area polygons. 


• Euclidean distance: In this step, we derive the 
Euclidian distance from facilities (dumping sites and 
industrial areas) to each pixel in the generated output 
raster. The formula for finding the Euclidian distance 
is depicted below: 

d(q,p) = y[TU(qi - PO 2 (3) 


Where: 

p = (pi, p2,..., pn) and q = (ql, q2,..., qn) are two 
points in Euclidean n-space. 
d: Distance from p to q 

The results of applying this step are depicted in 
Figures 6 and 7. 



• Feature to raster: During this step, we convert the land 
use feature class (vector data) to a raster that has the 
same cell size as the derived raster layers from the 
previous step. Accordingly, we can use all of them for 
further processing. 



Fig.7: Euclidean distance to industrial areas 


• Slope: In order to build a hospital, the land should be 
relatively flat. Therefore, we consider the slope of the 
land in our model by deriving the slope of the 
elevation dataset as shown in Figure 8. By this, we 
mean that the rate of maximum change in elevations 
is calculated. 

• Reclassification: Each cell in the study area now has 
a value for the following factors (existing hospitals, 
dumping sites, industrial areas, land use and 
elevation). We should combine the derived datasets 
in order to identify the potential location for building 
a new hospital in the next step (Weighted overlay). 
However, we cannot combine them in their current 
form. For example, there is no meaning to combine 
cell values that have 15 degrees slope with cell values 
that have agriculture land use that equals (6). 
Accordingly, to combine datasets, we need to derive 
a common measurement scale such as from 1 to 10. 
This scale identifies how suitable a specific location 
for building a new hospital. Lower values indicates 
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locations that are more suitable. The reclassification 
process is depicted in Figure 9. 



Fig.8: Slope of Jenin govemorate 


Weighted overlay table 
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Fig. 10: Weighted overlay 








v reclass dump 

22 

VALUE 



v Reclass rast1 

2 

Value 



H landuse ras 

11 

LANDUSE EN 

ir> 




Public Buildings a 

3 




Residence "B" 

3 




Plant and Animal F 

5 




Industrial Zone 

6 




Commercial "Long 

5 




Residence "A" 

4 




Cemetery 

4 




Residence "C* 

Restricted 




Roads Pedestria 

9 




Residence "Down 

6 




Residence "D" 

8 




Agricultural Zone 

2 




Archeological Ca 

5 




Aproved Road 

8 




Commercial "Main 

6 




Industrial "Artificial 

8 


Sum of influence | 100 j Set Equal Influence 

Evaluation scale From To By 

1 to U b, 1 ~j I I I I C 


OK | Cancel 



Fig.9: Classification of Euclidean distance values to classes from 1 
to 10 


As shown in figure 9, we classify the produced distance values 
from the Euclidean distance process into 10 classes by dividing 
the produced ranges into equal intervals. 

• Weighted overlay: Using this technique, we weight 
the values of each dataset by assigning each a 
percentage of influence. The higher the percentage, 
the more influence an input has in the suitability 
model. Some input values will be restricted. For 
example, areas that belong to “C” administrative 
division in West Bank are restricted as shown in 
Figure 10. The result of this step indicates how 
suitable each location for building a new hospital. 


• Majority filter: The size of the suitable area is an 
important criterion in identifying the optimal site for 
building a new hospital. Thus, we use this tool to 
ensure that the number of neighboring cells of a 
similar value must be large enough to build a new 
hospital. 

• Condition: Pixels with values less than 3 indicates 
suitable locations. Thus, the condition tool used to 
identify those locations. 


V. Experimental Results and Discussion 

This section describes the experiments carried out to evaluate 
our proposed suitability model. A prototype of the proposed 
model is implemented and experiments are conducted using a 
PC with dual-core CPU (2.1GHz) and (8 GB) RAM. The used 
operating system is Windows 10. 

• Comparing the Produced Results When Changing 
the Weights Assigned for all Input Data Sets 

In this section, we compare between the produced results by 
the proposed model when we change the weights assigned for 
the input data sets according to the following table. 

Table 3: The weights assigned for input datasets in 
different experiments. 
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Input data set 

Random 
weights in 
EXP.l 

Random 
weights in 
EXP.2 

AHP 
weights 
in EXP.3 

Fand use. 

0.09 

0.10 

0.22 

Distance to 

existing 

hospitals 

0.04 

0.20 

0.13 

Near main 

roads 

0.24 

0.04 

0.30 

Distance to 
dumping sites 

0.13 

0.06 

0.22 

Distance 

industrial 

areas 

0.30 

0.30 

0.02 

Elevation 

0.20 

0.30 

0.11 


As shown in table 3, we assigned different weights for each 
input data set in different experiments. After running the 
proposed model with weights from the first experiment 
(EXP.l) and the second experiment (EXP.2), we get the 
results shown in Figure 12. 


The best site for building a new hospital EXP.l 



The best site for building a new hospital EX P.2 



I I Supple areasforouHQUq new MosoBal EXP2 


Fig. 12: The optimal sites for building a new hospital with weights from 
EXP. 1 and EXP.2 respectively. 


As we can see in Figure 12, the produced results have various 
contiguous alternatives, which may be confusing for the 
decision maker. This is because the weights are randomly 
assigned for the input datasets without extensive care or 
research. 


However, we were able to achieve promising results for 
decision makers when using the weights produced from 
applying AHP as shown in Figure 13. 



VI. Conclusion and future work 


In this paper, we built a suitability model for selecting the 
optimal site for building a new hospital based on coupling 
GIS-based MCA and AHP. GIS tool and techniques are 
employed to analyze the list of identified criteria in hospital 
site selection. The analysis process incorporates assigning 
weights for the identified criteria based on AHP. And at the 
end of the analysis process, the optimal site for building a new 
hospital is identified. The results showed that assigning 
weights based on AHP is better than assigning weights 
randomly for the set of identified criteria. 
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Abstract- Over the years systems or applications 
in existence have been able to work seamlessly 
with relational databases. Applications such as 
point of sale, hospital management systems, 
Human resource applications, payroll systems, 
banking with relational databases banking 
systems just to mention a few are powered by 
relational databases with minimal to entirely no 
issues because most the relationships amongst 
entities in these systems are mostly not 
complicated or highly connected. In the same vein 
relational database systems have been able to 
handle large amounts of data and transactions 
emanating from everyday operations of these 
systems. In summary the relational database was 
very effective in dealing with the problems 
applications available were solving. In these 
modern times the influx of social media 
platforms, map and navigations systems, 
geospatial information systems, recommendation 
engines,referral systems and the likes have 
turned the tide for systems and the databases 
behind them to support, manage and model 
mostly semi/unstructured, connected and their 


complex associations amongst data elements. This 
study looked at such data with the concept of 
recommendations to test whether relational 
databases were still performing well with the 
current trends on connected data or the NoSQL 
paradigm making inroads in the technology space 
had a point when they say “they are the database 
paradigm for the future ” An experiment was 
therefore performed with a relational and graph 
database to ascertain this analogy. 

Key performance indices (KPIs) such as runtime, 
storage, schema flexibility, query complexity and 
general operations of each database paradigm 
were tested with the concept of making 
recommendations to a number of people in the 
database based on how their associated(friends) 
with each other. Each paradigm was put on similar 
test against the above KPI’s and the graph 
database seemed to have an urge over the 
relational database in the result and analysis of the 
figures that were obtained from the experiment. 
Keywords: Key Performance Index, Graph 
Databases, NoSQL, Relational Databases, DBMS 
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I. INTRODUCTION 
The battle between relational [9] [11] and non¬ 
relational databases began a while ago but 
technological and software application trends 
happening now and also what is perceived for the 
future has intensified this battle. Applications now 
are thriving more on information which relates to 
how stored data is connected and associated with 
each other for further analysis, decision making and 
recommendations. A direct example is the social 
media platforms available nowadays hence database 
management systems behind these systems have to 
be in the position to meet these ever changing trends 
and requirements [6]. Each database paradigm 
(relation and non-relational) has dealt with this issue 
in its own peculiar way which may or may not have 
worked in certain circumstances. So the question 
still remains which of these database systems better 
handles connected, associated data better for 
applications that need to interpret this data for 
analysis and decision making to work in a fast and 
efficient way, using less processing power, memory, 
storage space and with less complex queries[l]. The 
problem most often sterns from when the connected 
and associated data being used by these applications 
(social media platforms, map and navigation 
systems, geospatial information systems, 
recommendation engines (mostly behind e- 
commerce platforms), referral systems, Internet of 
things starts growing and complex questions and 
analysis are requested from the applications to drive 
short term to long term strategic decision making for 
organizations. . It is in such situations that the ACID 
[9] (atomicity, consistency, isolation and durability) 
features of a database are really tested. So the 
question still remains which type of database 
systems (relational or non-relational) is for the 
future in terms big and connected data which seems 
to be the bane of applications coming up each single 
day. 

This study therefore aims to test how each of the 
database paradigms will behave in terms of the 
parameters noted below. 

• General operational differences 

• Storage analysis on both DBMS 

• Runtime analysis of both DBMS 
interacting with associative data 

• Query complexity analysis 

• Handling of schemas for connected data 
sets. 

II. JUSTIFICATION OF THE STUDY 

The purpose of this study is to test and profile a 
relational [9] [11] and graph database [16] working 
behind a recommender application which works on 
analyzing the associations between data constituents 
for decision making and strategic planning. In other 
words, scenarios will be setup where by people 
having relations with others in the database are 


suggested products that friends have bought or liked 
using relational and graph queries[l][ll]. In a 
nutshell future applications which will be depending 
mostly on strong analysis of connected data will be 
built knowing which database paradigm will support 
and answer the complex questions that will be posed 
to such applications. 

On a higher level application throughput will be 
tested as query time for complex 
questions/situations which are solved by such 
applications tracing and analyzing patterns in user 
associations in data presented to predict future 
behavior or preferences. Also amount of physical 
storage to be used by DBMS (relational and graph 
database) as data grows in certain proportions in 
such situations will also be a question to be 
answered as well as how fast the application will run 
when data increases. In other words which database 
paradigm will survive with large amount of data 
with complex associations? Moreover, a test will 
also be made if the combination of the two databases 
backing the same application may or may not help 
solve complex situations better. This will base on 
leveraging the strengths of each database uncovered 
in the course of the study to mitigate the shortfalls 
of the each other. 

III. LITERATURE REVIEW 

Thought relational database concept proposed by 
E.F. Codd [9] has thrived over years and has stood 
the test of time giving it a merited status as a matured 
database system, [6] with the recent introduction of 
systems like the World Wide Web 
[15], social media, social networking applications 
[7], internet of things and the likes have put this 
maturity to test, raising questions of in database 
domain in relation to flexibility when it comes to 
data sets used in these modern applications[8] . This 
trend has caused the emergence of new database 
paradigms like NoSQL [14] trying to mitigate the 
short falls of the relational databases in this regard 
has a lot variants of databases from columns store to 
key-value database systems. This study looks at the 
graph database in relation to the relational database 
on how they handle related data objects for systems 
that need it for decision making. Related studies like 
one done by Emil Eifrem [4] [12] looked at 
comparing graph and relational database in regards 
to a social example of a 1000 customers/users with 
an average of 50 friend connections with degrees of 
connections giving friend-of-friend, friend-of- 
friend-of-friend down to the 4 th hop scenarios. 
Queries run against each database produced 
execution time of 2ms for graph and 2000ms for the 
relational database for a small dataset. The scenario 
got worse for the relational database since it had to 
be stopped after a day of running the same query 
which the graph database run in 2ms.From the study 
the graph database proved better in that regard. 
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Looking at a similar research by Sharma M. & Soni 
P. [1] to analyze the performance of both relational 
and graph databases with some predefined data 
processing queries based on a schema to analyze 
data processing and analyze time with various 
number of inputs of data objects. The schema used 
was modeled around a user, friend, movie, actor 
entities with the queries that need to get all friends 
of a user, favorite movies of a user’s friends and 
finding lead actors of the movies watched by a user’s 
friends. The study goes on to run these three queries 
against increasing datasets of 100,200,300 with the 
relational database having a greater runtime than the 
graph database at each three instances. B. Shalini & 
Tyagi C [6] evaluates graph (Neo4j) and relational 
(MySQL) databases based on evaluating parameters 
like level of support/maturity, security of the 
database and database flexibility. In terms of 
maturity an argument is made for the relational 
database for providing storage and robust support 
for commercial applications or products for over a 
decade as compared to graph databases which came 
into the technology limelight in the early 2000s so 
may not be at the same wavelength as relational 
databases when it comes to production testing over 
a long period of time. Looking at the security view 
point the relational database with its built-in 
multiuser support feature and restrictions is in better 
shape than the graph database because of its 
comprehensive support for Access Control List 
(ACL). Nevertheless the graph database takes lead 
when it comes to flexibility because of the short fall 
of relational databases to extend schemas or 
databases [10] .Relational database also lags when it 
comes to management of flexible schemas that 
change over time [13].As most studies seem to 
always target just the runtime of queries, this study 
also looks at that and extends the parameters to add 
storage analysis, compare schemas and investigate 
query complexities 


IV. METHODOLOGY 

For the analysis of connected data one has to know 
what connected data is. It is mainly data that has 
individual entities interconnected with each other 
such that decisions and analysis are made based on 
the connected relationships between the 
entities[6] [12]. The focus of this study is on such 
connected data and how it is analyzed by graph and 
relational databases taking into considerations the 
size of the data or rows involved from thousands of 
records, all the way up to a million interconnected 
records. 

This study will be centered around a social 
relationship [7] amongst friends based on which 
decisions or recommendations will be made to 
highlight the base concept of connected data. The 
recommendation technique used in this study to 
inform the composition of graph and relational 


database queries is the Collaborative Filtering (CF). 
CF also referred to as social filtering, filters 
information using recommendations of other people 
(mostly friends and acquaintances). For example a 
person who wants to watch movie may ask for 
recommendations from friends. The 
recommendations of friends who have similar 
interests are trusted more than recommendations 
from others [18].The data used for the study have 
been modelled around CF concept to ascertain 
which of the database paradigms works well in 
situations where connections among data entities are 
used to make recommendations .The data that will 
be looked into can be from a wide variety of 
products from a hypothetical ecommerce site but it 
will be towards users purchasing movies, books, 
games and electronic gadgets. Figure 1 is the 
underlying general data model. 



Figure 1: The general data model irrespective the 
type of database used. 

Figure 1 is the underlying general data model which 
projects four entities (Person, Friend, Item, and 
Transaction) for the study. This data model portrays 
people and their friends, how they relate to the items 
in a hypothetical online store from which 
recommendations will be made by the system using 
CF. A person buys or likes an item (s) be it a movie, 
book, game or electronic gadget. A person also has 
friend(s) who also has liked or bought an 
item(s).People can have a lot of friends, buy or like 
a lot of items so we have large amounts of data with 
entities associating or relating to each other .The 
question is how do we leverage on the features of a 
relational or graph database to find patterns in data 
that can help recommend or suggest items and even 
friends to people. 


59 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 





















International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


Item 



Figure 2: Relational Database Model 


Figure 2 shows the relational data model for the 
system. This shows the relationships the person 
entity has with the other entities through the foreign 
key constraint. This further elaborates that a person 
has friends and can buy or like an item which can be 
a book, game, movie or a gadget etc. In the relational 
database paradigm [9] this relationship is shown 
with a foreign key linking the person entity to the 
other entities. This is how relationships are 
portrayed in the relational sense showing 
connectivity amongst individual entities. 



Figure 3: Graph Database Model 


The figure 3 above depicts how data modeled in 
graph [2] [5] [10] looks like. In graph sense each 
entity is represented by a node with its properties 
being the attributes of that node. The relationships 
between nodes are represented by the labeled 
vertices between them [16]. 

This study looks at how these relationships between 
people, their friends and products can be leveraged 


to answer questions like “What are the products a 
person’s friends are buying” or “What products are 
a person’s friend of friends buying” which is the 
basis of recommendation systems to recommend 
products to people based of their relationship and 
buying patterns of friends, friends of friends etc. 
The system implementation in code was based on 
the below architecture. The system comprises of a 
JAVA Enterprise Edition Application riding on a 
graph (Neo4j) and relational (MySQL). 



Figure 4: General System Architecture 


From figure 4 the system comprises of: 

I. Application Server is a JAVAEE application 
deployed and running on a Glassfish server where 
the business logic for the application resides. This is 
where all decisions are made using the information 
obtained from the attached databases 

(MySQL and Neo4j). 

II. Neo4j [16] [13] Graph Database is Graph DBMS 
that interfaces with the graph implementation side of 
the Application.Neo4j can be implemented in server 
mode or embedded mode. For this study/application 
it is used in embedded mode meaning the 
management of the database is embedded in code 
using the Neo4j libraries this means where ever the 
code is deployed an instance of the Neo4j database 
is created. 

III. MySQL is the Relational DBMS that interfaces 
with the relational implementation side of the 
Application. 

The systems also works with the below functional 
requirements: 

I. This system works with a graph and relational 
database to help make product suggestions to people 
based on the pattern of purchases or likes other 
people and friends have been doing 
(Collaborative Filtering) [14]. 

II. Based on the behavior in (I) above each database 
will be populated and tested with increasing 
amounts of data (i.e10000 people, 100000 people or 
1000000 people) to answer questions like “what 
have friends of a person bought or liked so that 
recommendations can be made for that person. 

III. Analysis on the runtime, amount of storage used 
and complexity of queries from the database systems 
used from the data and questions posed to the system 
in (II) above will be looked at. 
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Any system that the application can be deployed on 

must satisfy the requirements below 

Software: 

• JAVA JDK 7 and above 

• MySQL RDBMS 

• Neo4j is run inembedded mode 

• GlassFish version 4 and above 
Hardware: System will be tested on a computer with 
the specification below 

• Intel Core i3 processor 

• 8GB RAM 

• 500GB hard disk space 

• OS Linux Ubuntu 16.04 

V. RESULTS 

Looking at the runtime analysis of the same query 
run with a start of hundred transactions all the way 
to a million with the query average runtime 
calculated in each scenario The below results was 
attained for relational & graph databases: 

Table 1: showing the runtime against data results for 
relational database 


data 

Execution time/secs 

100 

0.0028 

1000 

0.016 

10000 

0.164 

100000 

1.315 

500000 

6.244 

1000000 

12.896 


12.896 











data/records 


V 


Figure 5. Chart showing amount of data run in query 
against runtime for a relation database 


1 


Table 3: showing the storage against data results for 
relational database 


data 


Storage space/kbs 


Table 2: showing runtime against data results for 
graph 


data 

Execution time/secs 

100 

0.002 

1000 

0.006 

10000 

0.09 

100000 

0.219 

500000 

1.616 

1000000 

1.728 



data/records 


Figure 6: Chart showing amount of data run in query 
against runtime for a graph database 


Figure 7 shows runtime analysis trend of relational 
and graph databases as data increases. 


16 



^^^Relational DB Execution time/secs 
• Graph Execution time/secs 

Figure 7: Runtime analysis trend 


The same analysis is done with the storage space 
used by each database shown in figure 10. 


100 

792 

1000 

840 

10000 

2744 
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Figure 8.Chart showing database storage size 
against record size all in kilobytes 


Table 4: showing results for graph 


data 

storage/kbs 

100 

400 

1000 

2000 

10000 

18200 

100000 

105100 

500000 

535000 

1000000 

839000 


1000000 839000 



Figure 9: Chart showing database storage size 
against record size all in kilobytes 

Combined Storage Analysis Figure 10 shows the 
storage analysis trend of relational and graph 
databases as data increases. 


to 

_Q 



1000000 



data/records 


Relational storage/kbs « 


■Graph DB storage/kbs 


Figure 10: Storage space used by Graph & Relation 
database 


Looking at the trends in figures 7 and 9 it can be seen 
from the runtime metric trends that as data or 
number of records increase runtime on each 
database systems behave differently, for the 
relational database runtime starts increasing 
exponentially as the data size reaches 100000 and 
over hitting as much as 12secs with a million records 
which is worrying in terms of system performance 
with large datasets. The graph databases on the other 
hand maintains a steady increase in runtime as the 
data sizes reaches 100000 and maintains a runtime 
of under 2secs even with a million records making it 
a good option for systems that do large and 
connected data sets analysis. The story is different 
when it comes to the amount of space needed by 
each database to handle large sets of data. It can be 
seen that the graph database uses a large amount of 
space as the dataset gets larger whiles the relational 
database uses less that 60MB to handle a million 
records. Even though this development is alarming 
on the part of the graph database, storage does not 
seem to be problem for current systems and 
applications because most of the servers or machines 
come with not less than 500GB .That 
notwithstanding it is a big plus for the relational 
database for its storage optimization.When we 
assess the complexity of the queries on both sides 
we get to know why the relational 
database has a high runtime. From the query 
complexity analysis it is seen that when it comes to 
analyzing connected data the relational database 
often uses joins to associate entities. This presents a 
complexity of O (n2) because join queries are 
executed programmatically in nested loops whiles 
graphs queries have a complexity of O (log n) due to 
the traversal of the graph data structure. This 
explains the runtime patterns in figure 7. 
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VI.CONCLUSION 

The analysis of the study looked at the runtime of 
queries for the two database paradigms followed by 
analyzing the storage space used as the size of 
grows. Also, how both relational and graph 
databases model connected data in the quest to 
create a database structure that will fulfil the 
objective at hand. 

Based the on the analysis done it can be concluded 
that when it comes to dealing with big and connected 
data the graph database has an urge over the 
relational database. This makes graph database a 
great options for future applications which will 
make decisions based on complex data associations 
and cause to worry on how far the relational 
paradigm can carry us. Relational databases were 
conceived to digitize paper forms and automate 
well-structured business processes, and still have 
their uses. But RDBMS cannot model or store 
connected data and its relationships without 
complexity, which means performance degrades 
with the increasing number and levels of data 
relationships and data size. Additionally, new types 
of data and data relationships require schema 
redesign that increases time to market. 
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Abstract: Autism Spectrum Disorder (ASD) is a clinically heterogeneous neurological developmental 
disorder. It is called a spectrum disorder because of its range of symptoms. Early diagnosis and proper 
intervention is required for the effective treatment of autism. Diagnosis is based on the quantitative 
and qualitative analysis made by the clinician. The expertise of the clinician is so important in the 
proper diagnosis and classification of autism. This paper proposes an Expert system that act as a 
support system to the clinician. Major clinical attributes of autism along with facial features are used 
as input to the expert system. The main highlight is the use of feautures from 3D facial imagery for 
autism classification. The expert system operates in two modes, diagnosis mode and grading mode. 
Naive Bayes classifier is initially used for diagnosis mode where as overall system is implemented 
using a Neuro-Fuzzy approach. In the diagnosis mode 100% accuracy and in classification mode 
98.8% accuracy is obtained. 

Keyword: Autism, ASD, 3D Face, Neuro-Fuzzy System, Neural networks, Fuzzy logic, Expert 
system. 

Introduction: Autism spectrum disorder (ASD) is a clinically heterogeneous condition with a wide 
range of factors. A satisfactory diagnosis measure for ASD is currently unavailable. Autism is a 
neurological handicap in children, which is usually diagnosed in early child hood. There is lack of 
definitive biomarkers for autism diagnosis. The diagnosis mostly depends on a range of factors. People 
with autism show different clinical features and symptoms. There is lot of scope for quantitative 
research on ASD in developing countries like India. Satisfactory and accurate data for research in 
autism is unavailable in India. The frequency of ASD diagnosis is increasing. Many Factors like 
increased awareness,improved detection mainly contribute to this. The publication in DSM-5 on May 
2013 adds major revisions needed to remove the confusing labels associated with ASD. The earliest 
symptom is the absence of normal behavior. All children should be screened using a standardized 
Autism screening tool at 18 and 24 months of age [1]. 

Symptoms of ASD must be present in the earlier developmental period mostly by the second year of 
life (after 12 months). But least severe type of ASD may be diagnosed by 4 to 6 years or later. 
Intervention should begin as early as possible. In intervention consider the core distinctive features of 
autism and it should be specific and proof based. More over it should be well structured and 
appropriate to the developmental need of the child. 
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Even though there are inter individual 
some common characteristics like deficit in social 


Science and Information Security (IJCSIS), 
svels of children with ASD, they share 

interaction/communication and behavioral 


abnormalities. 


Studies shows that children who deficit to recognize face in childhood shows severe autistic features at 
teenage. Researches shows that human recognize a person by their body if someone is coming from far 
away or the face is obscured. So for identifying a person, brain uses facial characteristics and also 
other physical cues. 

Researchers at the University of Missouri have identified facial features measurements in children 
with autism and developed a screening tool for young children. The sample consists of children from 8 
to 12 years of age. Judith Miles, Professor Emeritus of child health-genetics in the MU Thompson 
Center for Autism and Neuro Developmental Disorders point out that a portion of those children 
diagnosed with autism tend to look alike with similar facial characteristics [2]. 

In this research we are developing an expert system that use core clinical features with its attributes, 
facial characteristics and parental status as input. 

Autism: Clinical features and Diagnosis 

Autism detection can be done by using quantitative tests and qualitative analysis. In DSM IV ASD 
diagnosis is based on Language delays, Social Communication Problem and Repetitive behavior. 
Where as in DSM V ASD diagnosis is having two criteria domains namely Social interaction domain 
and Repetitive or restricted behavior domain. The Core Clinical features of autism can be brought 
under the following heads with attributes. 

1. Behavioral problem 

a) Poor eye contact 

b) Lack of responsiveness to others 

c) Difficulty in building social relationship 

d) Repetitive acts 

e) Self harm 

f) Compulsive behavior 

g) Hyper Activity 

h) Poor j oint attention 

i) Solo play 

j) Excessive fear 

k) Poor emotional response 


2. Language Disorder 

a) Muteness 

b) Echolalia 

c) Sound making 


3. Intellectual retardation 

a) General intellectual retardation 

b) Brain Seizures 
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a) Open Eyes 

b) Wide Mouth 

c) Large region between mouth and nose. 

d) Expression less face 

e) Open mouthed Appearance 

f) Prominent Forehead 

5. Parents Status 

a) Not Autistic 

b) Autistic 

The earliest symptom is the absence of normal behavior. Normally when a parent or a healthcare 
provider notices any delay or abnormal behavior in the child at, or prior to the age of three they are 
prompted to consult a developmental pediatrician. The child is analyzed carefully and any abnormality 
is observed in the core functional areas, the developmental pediatrician recommends the child for 
assessment test using any of the standard autism testing tools. These tools are normally a checklist or 
questionnaire containing autism features. The clinician fills the data using his observation and a 
structured discussion with the parent of the child under scrutiny. After filling the details a final score is 
generated. Comparing the obtained score with the threshold value, the clinician initially classifies the 
child as either not autistic or autistic. The next step is to identify which Autistic class or grade the 
child belongs to. Based on the total score compared against a threshold the child is diagnosed as mild, 
moderate and severe. Consider the total score(S) adds up to 60 and the threshold is 30, the grade and 
remarks is as shown in table 1. 


Score 

Class/ Grade 

Remarks 

Score <30 

Normal 

Typical 

Score 30 to 34 

Mild 

Requiring support 

Score 34 to 38 

Moderate 

Requiring Substantial support 

Score >38 

Severe 

Requiring very substantial support 


Table 1: Score with Grade 


The expertise and dedication of the clinician is an important factor while analyzing the grade or class 
of autism. Expert clinician can easily spot the grade of autism. Some clinician fully depends on the 
diagnosis tool and there are possibilities of wrong classification. More over the fuzziness in the Score 
may also lead to misclassification. Studies say that a proper initial diagnosis and follow up is required 
for autism. If we are using an expert system as a support system for clinicians the misclassification and 
problems in initial diagnosis of autism can be avoided up to an extent. In this research we are 
developing an expert system to assist clinicians in their diagnosis procedure. 
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Silberberg et al.[3] focus on the prevalence of neuro-developmental disorder among children aged 2 
to 9 years in the different areas of India. They also analyzed the risk factors associated with neuro- 
developmental disorders along with the development of screening and diagnosing methodology. 

An investigation related to the epidemiology of ASD in India was reported by Mukerji et al.[4] 

Myers et al.[5] suggests that the primary goal of treatment for ASD is to maximize the child’s 
ultimate functional independence and quality of life by minimizing the core features of ASD. 

Robins et al.[6] objective is to validate the modified checklist for Autism in toddlers. 

Yasmin H. Nuggers[7] studied the prevalence, risk factors and diagnosis of ASD in developing 
countries. In his brief reviews controversies regarding the increase in estimate of prevalence, 
implications of changes in ASD definitions are also discussed. 

Vijay Sagar KJ[8] focus on the study of developmental disorders in India. He concludes his article by 
saying that there is a need of proper diagnosis and screening tools for Autism in India. 

Hammond et al.[9] proposes the use of dense face models in 3D Analysis of facial morphology. The 
model provide a detailed visualization of 3D face shape variation with capability to training the 
Physicians to recognize the core components of particular syndromes. Ten fold cross validation testing 
is done on the sample faces using different pattern recognition algorithm. 

Vezzetti et al.[10] highlights 3D human face descriptions, land marks measures and geometrical 
features. Analysis of facial morphology is very important in the study of facial abnormalities. 

Gupta et al.[ll] worked on the assumption that different facial expressions can be considered as 
isometric deformation of facial surfaces .Even though deformation occurs, the intrinsic property of the 
surface remain the same. 

Aldridge et al.[12] investigation focus mainly on the correlation between brain development and face. 
Brain develops in concert and coordination between the developing facial tissues. ASD is due to 
alteration in embryological brain, suggests that there are differences in the facial structures of ASD 
children and normally developing one. Finally the authors concludes that there are significant 
differences in the facial morphology of boys with a ASD compared normally developing one. 

Weigelt et al.[13 ] reports the face identity recognition is deficit in ASD. The deficit is both process 
specific and domain specific. They suggest that Autism is a domain specific disorder. 

Ruggeri et al.[14] objectives is to find the similarity and difference between the terms biomarker and 
endophenotype. There study includes the established biomarkers and endophenotype in autism 
research along with the discovery of new biomarkers. 

Dataset: The background study and data collection for this work is done at Block Resource Centre 
Cherthala, Kerala, India. BRC is a Government agency working along with Sarva Shiksha Abhiyan. 
The dataset consists of 47 children, which includes both boys and girls. The ratio of boys and girls is 
12: 1. The age is from 2 years to 12 years. While studying and analyzing the dataset we are making 
use of the expert opinion from Pediatric Neurologist, Developmental Pediatricians, Speech Therapist, 
Remedial Educators, Clinical Psychologist and Parents. 


69 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



_ _ International Journal of Computer Science and Information Security (IJCSIS), 

Proposed system with objectives: V ol. 15, No. 12, December2017 

Objective: Our research focus on developing an expert system for the initial diagnosis and grading of 
childhood autism. This system can be used as a support system for the clinicians while diagnosing 
autism. The proposed system is having two modes of operation, Diagnosis mode and Grading mode as 
shown in figure 1. Initially in the diagnosis mode expert system predicts whether the child is non- 
autistic or autistic. Once the output of the diagnosis mode is autistic then the next phase is activated. In 
this phase a detailed analysis is done and the possible outcome is the class or grade of autism. 


Expert System 


CORE Features - 


Diagnosis Mode 


Not Autistic /Autistic 


Autistic 


Sub features -H 

Face attributes -H 


Grading Mode 


Mild/Moderate/Severe 


Figure 1: Flow chart of the expert system. 


Scale 

Output 

Remarks 

0 

Normal 

Non autistic 

1 

Mild 

Requiring support 

2 

Moderate 

Requiring Substantial 
support 

3 

Severe 

Requiring very 
substantial support 


Table 2: Grading 


The core feature of autism is analyzed initially during the diagnosis mode. Core features includes 
Behavioral problem, Language disorder and General mental retardation. Based on the core features 
the diagnosis mode output is not autistic or autistic. 
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It the output is autistic then the second\pJjia|S£ features ot each ot the core features 

are analyzed. Facial features along with status of parents are along given importance. The most 
important characteristic of this expert system is the integration of facial features. The face image of the 
child under diagnosis is captured; it is modeled to 3D or captures the image using a 3D imaging 
system. In 3D imaging the geometric depth information is having more importance. Facial features 
include mouth, eyes and the region between mouth and nose. A portion of Autistic children’s have 
wider mouth, open eyes and large region between the mouth and nose as shown in figure 4. Other 
common facial characteristics are expression less face, open mouthed appearance and prominent 
forehead region. Using 3D Geodesic distance as the measure identifies the variance of the features 
from normally developing kids. Our aim is to extract the exact geometrical information from the face 
under scrutiny and compare it with a template and used this information for training. By using these 
facial attributes our focus is to study the contribution of each feature to grading of autism. 


Our analysis, point out the fact that Children below the age of 8 with other clinical features of Autism 
mostly lack the facial features mentioned above. But children from the age of 8 and 12 have shown the 
above mentioned facial features along with other clinical features of Autism. Our expert system is 
designed in such a way that the weightage of facial features is varied by considering the age of the 
child under diagnosis. If the age is below 8 the weightage of the features in percentage is as 75(core 
features) : 15(facial features) : 10 (Parents status). Whereas age range from 8 to 12 the weightage of 
the features in percentage is as 65(core features) : 25(facial features) : 10 (Parents status).Parent status 
is also considered, this feature include whether the parents are autistic or not and age of the parents 
during conception is also given weightage. 


In the grading phase three sets of features namely attributes from core features, facial attributes and 
parent’s status is considered. The weightage of the features varies depending on the age of the child. 
Initially we consider the two phases as two separate classification problem. In phase 1 the number of 
inputs are limited so a Naive Bayes classifier is applied and it suites our problem and it gives the result 
autistic or non autistic as shown in figure 2. The input to the classifier is the core features such as 
Behavior problem, Language Disorder and General Mental retardation. 


Core Features 


> 


r \ 

Naive Bayes 


Classifier 

\___ J 


Non Autistic / Autistic 

-► 


Figure 2: Diagnosis Mode 

In the second phase more inputs belonging to different features are considered which include attributes 
from the core features, facial region and parental status. Naive Bayes classifier is applied and result is 
analyzed but there exists some fuzziness after a certain threshold. We need to integrate the two phases 
and a neuro- fuzzy approach is applied. Soft computing approach like neural network and fuzzy logic 
can play a vital role in the design of such an expert system. Fuzzy logic is used to interpret expert 
knowledge directly using rules with linguistic base. In this system we are qualitatively collecting lot of 
information with structured discussion with parent and from clinician’s observation. 

Linguistic base can easily be framed into fuzzy rules. Neural network are good in recognizing patterns. 
So this hybrid approach yields better performance. The output of the grading phase is as shown in 
table 2. 
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To design the neuro-fuzzy system for diagnosis of autism we consider the attributes of core features, 
facial attributes and parental status. The hybrid architecture is as shown in figure 3. 



Figure3: Neuro-Fuzzy Expert System Architecture for predicting Autism 

The knowledge base consists of twenty two fuzzy parameters. The neural network is trained to learn 
the parameters of the membership functions representing the linguistic terms in the rule. Sample fuzzy 
rules applied in the Knowledge base is as follows: 

R1: If (Behavior Problem) && (Language Disorder) && (General Mental retardation) then 

belongs to class Autistic 

R2: If (Behavior problem Attributes ( 1 || 2 || .n)) &&( Language Disorder Attributes ( 1 || 2 

|| .... n)) && (Mental retardation Attributes(l || 2 || .n)) then belongs to class Autistic 

Different soft computing model have been tested like Naive Bayes, SVM, K-Means, FCM and Neuro 
Fuzzy with the same input attributes using Weka tool .The performance is evaluated and the most 
outstanding results are shown in table4. The operational procedure of the neuro fuzzy system for 
autism classification is shown in figure 5 

The expert system is tested and evaluated by the different stakeholders, the accuracy and evaluation 
survey summary is shown in figure 6 and 7. 


Technique 

Sample size 

Inputs 

Outputs 

Accuracy rate 

Naive Bayes 

47 

12 

2 

100 

Neuro- Fuzzy 

47 

22 

4 

98.8 


Table4: Performance of Classifier. 
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Figure 4: Facial Features 



Figure 5: Operational Procedure of the Neuro-Fuzzy system for Autism classification 
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Figure 7: Expert system evaluation Survey 

Conclusion: Studies related to the cause and symptoms of Autism spectrum disorder are going on 
around the world. Information Technology is finding lots of application in all fields. Due to the 
complexity and heterogeneous nature of this disorder, fewer works are reported which make use of IT 
in this area. Our expert system captures different inputs and produces an appropriate output. This 
system can be used by clinicians as a support system. The expert system is used and evaluated by 20 
potential users and they all provide positive responses relating to input, output and quality of the 
system. Integrating 3D facial features as input to the system add a new dimension in Autism research. 
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Abstract 

The digital color images are the most 
important types of data is now 
circulating on the Internet, so the 
protection and security of the image 
transition has the top priorities of 
computer experts. Many researchers had 
developed different techniques to 
increase the security of image 
transmission and most of these 
techniques suffer from the slow of the 
encryption-decryption process. In this 
paper we will produce a classification of 
the most popular encryption-decryption 
techniques and suggest an efficient one, 
the suggestion will be based in many 
factors such as speedup, throughput, 
encryption-decryption error and the 
hacking factor. 

Key words: Encryption, decryption, 
speedup, throughput, hacking. 

1. Introduction 

Encryption is defined as the conversion 
of plain message (matrix which 
represents digital color image) into a 
form called a cipher text that cannot be 
read without decrypting the encrypted 
text [15]. Decryption is the reverse 
process of encryption which is the 
process of converting the encrypted text 
into its original plain text, so that it can 
be read [15], Color image encryption is 
to be done before transition the image 


and it has to be done securely over the 
network so that no unauthorized user can 
able to decrypt the image. Image 
encryption, video encryption, chaos 
based encryption have applications in 
many fields including the Internet 
communication, transmission, medical 
imaging. Tele-medicine and military 
Communication, etc. The evolution of 
encryption is moving towards a future of 
endless possibilities. The image data 
have special properties such as bulk 
capability, high redundancy and high 
correlation among the pixels. Encryption 
techniques are very useful tools to 
protect secret information [3], 

Encryption of data [16] has become an 
important way to protect data resources 
especially on the Internet, intranets and 
extranets. Encryption is the process of 
applying special mathematical 
algorithms and keys to transform digital 
data into cipher code before they are 
transmitted and decryption involves the 
application of mathematical algorithms 
and keys to get back the original data 
from cipher code. The main goal of 
security management is to provide 
authentication of users, integrity, 
accuracy and safety of data resources 
[16]. 

2. Related works 

Guodong Ye [9] have presented an 
efficient image encryption algorithm 
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using double logistic maps, in which the 
digital matrix of the image is confused 
from row and column respectively. 
Confusion effect is carried out by the 
substitution stage and Chens system is 
employed to diffuse the gray value 
distribution. Haojiang Gao et al. [5] have 
presented an algorithm presented a 
Nonlinear Chaotic Algorithm (NCA) by 
using power and tangent functions 
instead of linear function. The 
encryption algorithm is a one-time-one- 
password system and is more secure than 
the DES algorithm. Jawahar Thakur et 
al. [17] presented a comparison between 
symmetric key algorithms such as DES, 
AES, and Blowfish. The parameters such 
as speed, block size, and key size are 
considered to evaluate the performance 
when different data loads are used. 
Blowfish has a better performance than 
other encryption algorithms and AES 
showed poor performance results 
compared to other algorithms due to 
more processing power. 

Khaled Loukhaoukha et al. [9] 
introduced an image encryption 
algorithm based on Rubik’s cube 
principle. The original image is 
scrambled using the principle of Rubik’s 
cube and then XOR operator is applied 
to rows and columns of the scrambled 
image using two secret keys. Liu 
Hongjun et al. [18] designed a stream- 
cipher algorithm based on one-time keys 
and robust chaotic maps. The method 
uses a piecewise linear chaotic map as 
the generator of a pseudo-random key 
stream sequence. 

M. Zeghid et al. [19] analyzed the AES 
algorithm, and added a key stream 
generator (A5/1, W7) to AES to ensure 
improved encryption performance 


mainly for the images. The method 
overcomes the problem of textured zones 
existing in other known encryption 
algorithms. Maniccam el al. [20] 
presented a method for image and video 
encryption and the encryption methods 
are based on the SCAN methodology. 
The image encryption is performed by 
SCAN-based permutation of pixels and a 
substitution rule which together form an 
iterated product cipher. The pixel 
rearrangement is done by scanning keys 
and the pixel values are changed by 
substitution mechanism. Figure 1 shows 
the basic SCAN patterns used in [16]. 
Mohammad Ali el al. [21] introduced a 
block-based transformation algorithm 
based on the combination of image 
transformation and the Blowfish 
algorithm. The algorithm resulted in the 
best performance by the lowest 
correlation and the highest entropy. The 
characteristics of AES are its security 
and resistance against attacks and the 
major characteristic of RC4 algorithm is 
its speed [11]. A hybrid cipher by 
combining the characteristics of AES 
and RC4 is developed and 20% 
improvement in speed is achieved when 
compared to the original AES and a 
higher security compared to the original 
RC4 [13], 

Rizvi et al. [12] analyzed the security 
issues of two symmetric cryptographic 
algorithms Blowfish and CAST 
algorithm and then compared the 
efficiency for encrypting text, image, 
and audio with the AES algorithm across 
different widely used Operating 
Systems. For text data, all algorithms run 
faster on Windows XP but Blowfish is 
the most efficient and CAST run slower 
than AES. Blowfish encrypts images 
most efficiently on all the three 
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platforms. For audio files, CAST 
performs better than Blowfish and AES 
on Windows XP but on Windows Vista 
and Windows 7, there is no significant 
difference in the performance of CAST 
and AES; however, Blowfish encrypts 
audio files at less speed. 

Sanfu Wang et al. [21] presented an 
image scrambling method based on 
folding transform to folding matrix 
which is orthogonal and enables to fold 
images either up-down or left-right. 
When an image is folded this way 
repeatedly, it becomes scrambled. The 
scrambling algorithm has an effective 
hiding ability with small computation 
burdens as well as wide adaptability to 
images with different scales. 

Sathishkumar G.A et al. [14] presented a 
pixel shuffling, base 64 encoding based 
algorithm which is a combination of 
block permutation, pixel permutation, 
and value transformation. The crypto 
system uses a simple chaotic map for 
key generation and a logistic map was 
used to generate a pseudo random bit 
sequence. The total key length is 512 
bits for each round and the key space is 
approximately 2512 for ten rounds. Shao 
Liping et al. [4] proposed a scrambling 
algorithm based on random shuffling 
strategy which could scramble non 
equilateral images and has a low cost to 
build coordinate shifting path. The 
algorithm is based on permuting pixel 
coordinates and it could be used to 
scramble or recover image in real time. 
T. Sivakumar, and R. Venkatesan [4] 
proposed a novel image encryption 
approach using matrix reordering this 
approach was tested and some 
comparisons with other techniques were 
done. 


Ziad A. Alqadi and others in [1] and [2] 
have presented a technique using direct 
and inverse conversions to convert a 
color image to gray image and vice 
versa, this technique can be useful to be 
used in color image encryption 
decryption. 

3. Propped methods 

3-1 First method (proposed 1): Using 
each of the components of the color 
image 

This method for encryption can be 
implemented in the following steps: 

1. Get the original color image. 

2. Extract the red, green, and blue 
matrices from the original color 
image (each of them is 2 
dimensional matrix), 

3. Reshape each matrix in step 2 to 
square matrix. 

4. Generate one random square 
matrix for each component to be 
used as a private key. 

5. Encrypt each component by 
applying matrix multiplication of 
the matrix component and its 
private key. 

6 . Reshape each encrypted matrix 
to its original size. 

7. Form the encrypted color image. 

The decryption phase can by 
implemented applying the following 
steps: 

1. Get the decrypted color image. 

2. Extract the red, green, and blue 
matrices from the original color 


78 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


image (each of them is 2 
dimensional matrix), 

3. Reshape each matrix in step 2 to 
square matrix. 

4. Use each private key 

5. Decrypt each component by 
applying matrix multiplication of 
the matrix component and the 
inverse it's private key. 

6 . Reshape each decrypted matrix 
to its original size. 

7. Form the decrypted color image. 

The following matlab code was written 
to implement this method 

clear all 
close all 

a=imread('C :\Users\User\Desktop\flower 

-color-combinations.jpg'); 

subplot(2,2,l) 

imshow(a), title 'Original image' 
subplot(2,2,2) 

imhist(a(:,:,l)), title 'Red component 

histogram' 

subplot(2,2,3) 

imhist(a(:,:,2)), title 'Green component 

histogram' 

subplot(2,2,4) 

imhist(a(:,:,3)), title 'Blue component 

histogram' 

tic 

bl=a(:,:,l); 

b2=a(:,:,2); 

b3=a(:,:,3); 

b 1 =reshape(b 1,200*300,1); 
b2=reshape(b2,200*300,1); 
b3=reshape(b3,200*300,1); 
for i=60001:60025 
bl(i,l)=0; 
b2(i,l)=0; 
b3(i,l)=0; 
end 

c 1 =reshape(b 1,245,245); 


c2=reshape(b2,245,245); 

c3=reshape(b3,245,245); 

k 1 =rand(245,245); 

k2=rand(245,245); 

k3=rand(245,245); 

cl=double(cl); 

c2=double(c2); 

c3=double(c3); 

el=cl*kl; 

e2=c2*k2; 

e3=c3*k3; 

toe 

tic 

dl=el*inv(kl); 
d2=e2*inv(k2); 
d3=e3*inv(k3); 
dll=reshape(dl,245*245,1); 
dl2=reshape(d2,245*245,1); 
dl3=reshape(d3,245*245,1); 
for i= 1:60000 
d21(i,l)=dll(i,l); 
d22(i, 1 )=d 12(i, 1); 
d23(i,l)=dl3(i,l); 
end 

d31=uint8(d21); 

d32=uint8(d22); 

d33=uint8(d23); 

d41 =reshape(d31,200,300); 

d42=reshape(d32,200,300); 

d43=reshape(d33,200,300); 

d4(:,:,l)=d41; 

d4(:,:,2)=d42; 

d4(:,:,3)=d43; 

toe 

figure 

subplot(2,2,l) 

imshow(d4), title 'Decrypted image' 
subplot(2,2,2) 

imhist(d4(:,:,l)), title 'Decrypted red 

component histogram' 

subplot(2,2,3) 

imhist(d4(:,:,2)), title 'Decrypted green 

component histogram' 

subplot(2,2,4) 

imhist(d4(:,:,3)), title 'Decrypted blue 
component histogram' 
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3-2 Second method (proposed 1): 
Converting color image to 2 
dimensional matrix 

The encryption phase here is consisted 
of the following steps: 

1. Get the original digital color 
image as a 3 dimensional 
matrix(m). 

2. Reshape m into 1 column 
matrix (r). 

3. Get the size of r (s). 

4. If s is a square number proceed to 
step 6. 

5. Find the nearest square number 
to s and adjust s to this number, 
adjust r by padding zeros. 

6. Reshape r to square matrix (rl). 

7. Generate a double random square 
matrix with size equal rl size, 
this matrix will be used as a 
private key for encryption- 
decryption (k). 

8. Save k to be used in the 
decryption phase. 

9. Get the encrypted image (e) by 
applying matrix multiplication of 
rl and k. 

10. Reshape e into 1 column matrix 
(el). 

11. Omit the padded zeros from el. 

12. Reshape el into 3 dimensional 
matrix to get the encrypted color 
image. 


The decryption phase can be 
implemented applying the following 
steps: 

1. Get the encrypted digital color 
image as a 3 dimensional 
matrix(enl). 

2. Reshape en into 1 column 
matrix(en2). 

3. Get the size of en2 (s). 

4. If s is a square number proceed to 
step 6. 

5. Find the nearest square number 
to s and adjust s to this number, 
adjust en2 by padding zeros. 

6. Reshape en2 to square matrix 
(en3). 

7. Use the private key k. 

8. Get the decrypted image (di) by 
applying matrix multiplication of 
rl and the inverse of k. 

9. Reshape di into 1 column matrix 
(dil). 

10. Omit the padded zeros from dil. 

11. Reshape dil into 3 dimensional 
matrix to get the decrypted 
original color image. 

The following matlab code was written 
to implement this method 

clear all 
close all 

a=imread('C : \U sers\U ser\Desktop\flower 

-color-combinations.jpg'); 

subplot(2,2,l) 

imshow(a), title 'Original image' 
subplot(2,2,2) 

imhist(a(:,:,l)), title 'Red component 

histogram' 

subplot(2,2,3) 
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imhist(a(:,:,2)), title 'Green component 

histogram’ 

subplot(2,2,4) 

imhist(a(:,:,3)), title 'Blue component 

histogram’ 

tic 

b=reshape(a,200*300*3,1); 
for i= 180001:180625 
b(i,l)=0; 
end 

c=reshape(b,425,425); 

k=rand(425,425); 

c=double(c); 

e=c*k; 

toe 

tic 

d=e*inv(k); 

dl=reshape(d,425*425,1); 
for i=l: 180000 
d2(i, 1 )=d 1 (i, 1); 
end 

d3=uint8(d2); 

d4=reshape(d3,200,300,3); 

toe 

figure 

subplot(2,2,l) 

imshow(d4), title ’Decrypted image’ 
subplot(2,2,2) 

imhist(d4(:,:,l)), title ’Decrypted red 

component histogram’ 

subplot(2,2,3) 

imhist(d4(:,:,2)), title ’Decrypted green 

component histogram’ 

subplot(2,2,4) 

imhist(d4(:,:,3)), title ’Decrypted blue 
component histogram’ 

3-3 Third method (proposed 
l):Converting color image to Gray 
image 

This method can be implemented as first 
method but the color image is to 
converted to gray image using direct 
conversion proposed by the author in [1], 
then the gray image can be encrypted as 
in method 1, after that the encrypted 


gray image can be decrypted and 
converted to color image using the 
inverse conversion mentioned in [1]. 

4. Experimental results 

The proposed methods were 
implemented several times using 
different color images with different 
sizes and the results always give a 
correlation coefficient equal 1 between 
the original image and the decrypted 
one, which means that the methods are 
100% correct and do not lead to any 
damage of information, figure 1 and 2 
show the original image and the 

decrypted one with the histogram of 
each component of the color image. 

The proposed method is also very secure 
and it is implausible to hack the image 
because the private key has the 

following features: 

• Private key is a 2 dimensional 
matrix with a huge size. 

• Each element in the private key 

is a random double number 

which make it impossible to 

guess. 
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Figure 1: sample of the original color 
image 



Figure 2: Decrypted color image. 


The encryption and decryption times 
were calculated and compared with other 
methods mentioned in the related works, 
these results are listed in table 1. 


Table 1: Comparisons results 


Meth 

od 

Dire 

ct 

cony 

ersio 

n 
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(s) 

Deer 

yptio 

n 
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ed 
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74 
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1] 

8 

2 
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55 
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7 
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5 

5 
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0 

0 
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6 
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0 
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8 



The speedup was calculated by dividing 
the total time of the method by the total 
time of proposed 1 (which was taken as 
a reference because it has the best 
efficiency). 
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The throughput was calculated by 
dividing the color image size by the total 
time. 

For clarity we can represent the data in 
table 1 by figures 3 and 4. 


From the above results we can see that 
the proposal method 1 has the best 
efficiency. 

Conclusions 


1 

2 

3 

4 

5 

6 
7 

a 

3 

10 


21.2543 

8.5343 

10.1018 

20.7776 

3.4133 

1.5728 

6.5536 

1.4043 

0.7786 

1.3661 



A methods of encryption-decryption 
of color image were proposed and a 
survey analysis was done and it was 
shown that proposed 1 method has the 
best performance because it 
characterized with following features: 

•Best speed in encryption phase. 

•Best speed in decryption phase. 

•Best throughput. 

•No any damage of information. 

•Impossible to hack. 
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Abstract — The technological era we live in has influenced our 
daily activities at home, work and everywhere. The software 
technology has prospered, and new technology is being introduced 
to end-users every day. Document image analysis software has 
been influenced with more applications being developed. Hence, 
this inspired us to perform a general study on application software 
with emphasis on document analysis software i.e., Optical 
Character Recognition (OCR). This study starts with a brief 
review on recent works on document processing and Arabic OCR. 
Following this, a questionnaire survey is conducted to investigate 
the capability and familiarity of individuals with the top four 
Arabic OCR software, in order to provide recommendations and 
future research directions in this area. The results show that OCR 
is an essential technology which should be available with operating 
system tools. From the survey results, many of the respondents are 
not familiar with the top four commercial OCR, even those who 
are in the Information Technology (IT) sector. The study 
concludes that the available commercial OCR software are 
becoming more efficient; however, the accuracy rate reported 
needs to be further evaluated to provide a more accurate 
performance/recognition rate and, further investigation is needed 
to analyze the reported commercial accuracy rates. Finally, the 
study concludes with recommendations and future research 
directions. 

Index Terms — Character recognition, Document Analysis, 
Editing Software, Optical Character Recognition Software, 
Scanner. 


I. Introduction 

OCUMENT processing is an area of research that includes 
pattern recognition, artificial intelligence, data mining, 
information retrieval, image processing, and computer vision. 
Document processing, i.e. OCR, has been used in coordinating 
and conducting business transactions, mail sorting, check 
processing, passport processing, online publishing, digital 
libraries ... etc. The advancements in technology has 
revolutionized document processing, with the increase in speed, 
emergence of new storage medium, and increase storage 
capacity. Document processing is the capture of information 
from a paper medium into a digital medium, in other words, 
digitization. This may involve processing documents 
containing text only or documents containing mixed content 
(text and images). The process starts by separating/segmenting 
the documents into text and images which are then processed 
separately. Nowadays, with the available advanced technology, 
documents can be processed automatically with high precision 
results. Text processing involves the conversion of paper or 
image documents into digital formats, i.e., OCR, which can be 
further processed. OCR is a common method of digitizing 


printed texts, to enable searching, editing, storing, transmitting 
or further processing of documents [1]. Similarly, image 
processing involves editing, storing, transmitting and further 
processing of images. 

The term digitization has emerged since the last two decades. 
It is the conversion of paper documents into electronic formats, 
and nowadays the terms e-government, e-library, e-services, e- 
learning, e-commerce ... etc. have emerged; meaning 
digitization is being applied in many sectors/areas. Hence, the 
reduction in the amount of papers used is seen in many areas. 
Many libraries are reducing the paper books on their shelves 
and converting to digital libraries. Therefore, with the vast 
amount of information to be digitized many governments, firms 
and libraries around the world have started projects to digitize 
their paper contents, especially ancient manuscripts, thesis, 
books, and old documents. Therefore, digital libraries are now 
available online, which makes it easier to find information on a 
click of a button [2]. 

Computers are becoming faster and more reliable in 
document processing; in addition, database technologies for 
archived information maintenance have existed for long time 
and been used to store vast amount of information. Thus, OCR 
intelligence can be applied to digital image documents, which 
means the reduction in cost and time, since searching for 
information on a computer is much faster than finding it in a 
pile of dusty documents, which may take from several minutes 
to several hours. Therefore, the purpose of digitization is to 
produce digital documents which can be edited, stored, 
searched, transmitted online and used in other applications such 
as machine translation, text mining, and text-to-speech 
conversion. 

This study explores the user experience with application 
software, document processing software and the four top 
available Arabic OCR software. The purpose is to get the user’s 
perspective on document processing software. The main 
objectives of this study are: 

- To explore the level of competence of individuals with basic 
application software skills in creating and sharing documents. 

- To investigate how familiar and capable individuals are with 
OCR software. 

- To investigate the accuracy of the recognition rate of the top 
four commercial OCR software from the user perspective. 

- To study the current status of research in the area of Arabic 
OCR. 

This paper is organized as follows. After this introduction, 
Section 2 provides the literature survey. Section 3 explains the 
research methodology. Section 4 presents the results and 
analysis. Section 5 provides the limitations and discussion, and 
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finally the conclusion and future work are stated in section 6. 

II. Literature review 

The efficiency of document analysis has progressed and 
advanced with technology. Applications which use OCR have 
increased with the emergence of more efficient OCR software. 
The early versions of OCR software were very limited and 
worked for specific applications on specific fonts at a time. 
Early OCR may be traced to technologies involving telegraphy 
and creating reading devices for the blind [3], such as the 
machine developed by Emmanuel Goldberg in 1912 which 
converts characters into standard telegraph code and the 
Optophone device developed by Fournier D’Albe in 1914. 
Developments in OCR continued from the early 20 th century 
until the era of the first digital computer in the early 1940s. 
Then, in the 1950s OCR became part of the business world [4]. 
Nowadays, advanced systems can produce a high degree of 
recognition accuracy for most fonts and different file formats, 
providing results which are similar to the format and layout of 
the input documents including columns, images, fonts, styles ... 
etc. Currently, OCR is available online as a service in a cloud¬ 
computing environment i.e., APIs, which can be used from any 
device connected to the Internet. These OCR API provide a 
simple way of parsing images and even multi-page documents 
to provide the results as a text file. In addition, various 
commercial and open source OCR systems are available for 
most of the common languages [5-6]. 

The work on document analysis is very active. Recently, the 
shift in research has moved from segmenting simple documents 
with text and images to segmenting text from scenes, billiard 
boards, movies ... etc. Document analysis goes through many 
steps and starts with processing the document digital image and 
ends with characters being recognized. The document analysis 
starts with a digital document image, then preprocessing is 
applied depending on the quality of the image. Preprocessing 
may include filtering noise, de-skew image, converting to gray 
or binary image ... etc. Next, page segmentation techniques are 
applied to identify different regions on the page, this is shown 
in Figure 1. After this, the text and graphics/images regions are 
both processed separately. Processing of the text portion of a 
document image includes: script recognition, font recognition, 
line/paragraph segmentation and word segmentation, Figure 2. 
Further processing of word segmentation includes, Holistic 
Word Recognition, Integrated Segmentation and Recognition, 
and Character Segmentation (OCR), Figure 3. 


Input 

Input Document 
Image 


Preprocessing 

Processed Image 


Layout Anlaysis r eK t 

i F age Segm e ntatio n) ( Fig u re 2) 

Images 

{Photos/Gra ph ics/ signatures) 


Figure 1: Document Image Analysis (Page Segmentation) 

A general OCR system, Figure 3, goes through the following 
stages: preprocessing, character segmentation, feature 
extraction, classification and recognition. After a digital image 


is fed into the system, preprocessing techniques may be used to 
remove noise, de-skew image, sharpen images ... etc. 


Text Region 


• Line/paragraph segmentation 

• Script Recognition 

• Font Recognition 

• Word Segmentation 


* Holistic Word Recognition 
Wor{ j * Integrated Segmentation and 
Segmentatoin Recognition 

•CharacterSegmentation (OCR) 
(Figures) 


Figure 2: Text Region Processing 


Next, the characters are isolated or segmented, and 
normalized. In addition, other techniques may be used to help 
prepare the characters for the feature extraction stage. The next 
stage is to classify and recognize the characters. Finally, post 
processing (contextual, grammatical information or data 
dictionary authentication), may be used to aid in achieving 
higher recognition rates. All these stages work in a pipeline 
fashion. The previous stage feeds into the next stage; therefore, 
the success of each stage guarantees an efficient OCR system. 



Figure 3: General OCR System 


The work on OCR is still active in all stages presented in 
Figure 3. New techniques for OCR preprocessing, feature 
extraction, classification, and recognition are being published 
in recent literature. The work in [7] presents a comprehensive 
survey on character segmentation and challenges for 
segmentation of Arabic script based languages, i.e, Arabic, 
Urdu, Persian, Pashto, Sindi and Malay (Jawi). However, this 
research concentrated more on Urdu and concluded that the 
cursive nature of the Arabic script is the main challenge in 
character segmentation especially in NastaTiq writing style 
compared to Naskh. Some other works related to document 
analysis include analysis of documents with non-uniform 
background [8], page segmentation [9], Arabic script 
recognition [10] [11], and segmentation of Arabic characters 
[12]. In addition, recent surveys in the areas of printed and 
handwritten Arabic and Urdu OCR research are available in the 
literature [13-17]. 

The research work on OCR for cursive scripts, such as 
Arabic, are not as mature as is the case with Latin scripts. 
Therefore, document image databases are needed for training, 
testing, validating and eliminating of errors. The work in [18] 
presents a multilingual image database created from various 
texts containing multiple fonts collected from various sources 
and 84 language scripts including Arabic script. The Arabic 
script is adopted by languages such as Kashmiri, Kurdish, 
Pashto, Persian, Punjabi (Shahmukhi), Sindhi, Urdu, and 
Uyghur. The images in the database were converted into word, 
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single line, multiline and paragraph images. Therefore, this 
database provides a platform for a step towards the 
establishment of standardized image database for the OCRs of 
the world language scripts. This database can be expanded to 
include more images and scripts, which researchers in the area 
of document analysis can benefit from [18]. Other works which 
discussed databases include the following scripts: Sindi [19], 
handwritten Persian [20], Pashto [21], handwritten Urdu [22], 
and Arabic [23] 

Recent OCR works on printed Arabic scripts include the 
works done on the following Arabic scripts Urdu [24], Pashto 
[25], Persian [26], Punjabi [27], Uyghur [28], and Sindhi [29]; 
and for handwritten Arabic scripts include Urdu [30], Pashto 
[31], Persian [32], Punjabi [33], Uyghur [34], and Sindhi [35]. 

The research in the area of printed Arabic OCR resulted in a 
number of high quality commercial recognition systems. 
However, the general problems of such systems have not been 
solved yet. The main challenges, which are not solved in many 
OCR systems, still linger: multi-language/script recognition, 
multi-column reading order, very noisy documents, layout 
retention, and handwriting script [36]. The authors in [36] 
developed a multi-stage approach to document analysis 
involving preprocessing, content segmentation, recognition, 
and correction. 

Currently, commercial OCR software for printed Latin 
scripts have reported 100% accuracy rates; however, this rate is 
not possible without the use of contextual, grammatical 
information or data dictionary authentication even where clear 
imaging is available. For example, using a smaller dictionary 
can help achieve a higher recognition rate for reading the 
amount line of a cheque. As a result, the recognizing of words 
from a dictionary is easier than trying to parse individual 
characters from script [37]. One study based on recognition of 
newspaper pages from the 19 th and early 20 th century concluded 
that character-by-character OCR accuracy for commercial OCR 
software varied from 81% to 99% [38]. On the other hand, the 
accuracy rates for hand-printed text is 80% to 90% on neat, 
clean hand-printed characters, and hence this accuracy rate still 
translates to dozens of errors per page, making the technology 
useful only in very limited applications. Furthermore, the 
accuracy rates for cursive text is still low especially in scripts 
other than Latin where more research is needed to achieve 
higher rates. Recognition of hand-printing, cursive 
handwriting, and printed text in other scripts are still the subject 
of active research [37 - 39]. 

OCR is used to be part of scanner software; then, it was 
available to be used on image documents without the use of a 
scanner. Now, OCR software is available as mobile apps where 
documents are being processed by taking an image using the 
mobile camera. Also, some OCR companies are providing OCR 
as an online service, where the documents are being processed 
on the cloud, i.e., OCR API [6]. Therefore, OCR applications 
are real-time and are used in government offices for scanning 
passport applications, license plate identification, ... etc. 
Nowadays, several OCR companies provide online service to 
process documents on cloud [6]. In this study, the top four 
Arabic commercial OCR software are studied in order to 


provide recommendations and future directions in the area of 
OCR; these software are, Adobe Acrobat, OmniPage Standard, 
ABBYY FineReader, and Readiris. Table 1 shows the OCR 
performance rates and number of languages supported for these 
software. The data in Table 1 is obtained from 
www.toptenreviews.com [40]. 


Table 1: Reported OCR performance rates and number of languages supported 



Usability 

Text 

Accuracy 

Languages 

Recognized 

Adobe Acrobat 

91% 

100% 

190 

OmniPage Standard 

83% 

99.80% 

120 

ABBYY FineReader 

66% 

99.84% 

190 

Readiris 

75% 

99.83% 

130 


*source: http://www.toptenreviews.com/business/software/best-ocr-software/ f401 


Most of these software report perfect or close to perfect 
accuracy rates; however, no specific accuracy rates are 
provided for each of the languages supported by the software. 
Therefore, further studies are needed to evaluate the recognition 
rates of each language. As it can be seen, the highest number 
of languages supported is 190 by three software namely: Adobe 
Acrobat, and ABBYY FineReader. 

A simple search for the term “OCR” on www.download.com 
provides 503 hits as of 24 th of April 2017, between open source 
and commercial software [41]. Most of these software are open 
sources and they are usually research projects implemented by 
individuals or research groups. In general, the performance of 
open source OCR software is usually very low, have few editing 
capabilities, and is far from reaching the performance levels of 
commercial OCR software [41]. 

III. Research METHODOLOGY 

A semi-structured questionnaire survey was designed. The 
survey aims to investigate the objectives outlined in the 
previous section. The questionnaire was first prepared and then 
disseminated to five subjects. After that, the questionnaire was 
modified. The survey questions are given in Table 2. Then 
three semi-structured interviews with faculty members from the 
IT sector were conducted to discuss the survey questions. The 
purpose of this step is to get feedback and comments on the 
clarity of the survey in order to consolidate some of the findings 
that were observed from the survey. Then, after an in-depth 
review of the subject matter, the semi-structured questionnaire 
was designed using www.monkeysurvey as a tool for data 
collection. 

The survey was disseminated to the author’s contacts using 
email, WhatsApp, and Researchgate contacts. These 
individuals are mainly working in higher-level academic 
institutions all over the world. Approximately, 800 emails were 
sent, and it is expected that the response rate will be between 10 
- 20%. This study explores the user experience with 
application software, document processing software and the 
four top Arabic OCR software in order to investigate the user’s 
perspective on document analysis software. 
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Table 2: The survey questions with no. of responses to each question 


No. 

Question Statement 

Type of 
Question 

Answered 

Skipped 

Percentage 

Answered 

1 

In what country do you live?* 

Dropdown 

124 

2 

98.4% 

2 

What is your gender? 

Multiple Choice 

124 

2 

98.4% 

3 

What is your age? 

Dropdown 

126 

0 

100.0% 

4 

What is your area of expertise/study?* 

Dropdown 

125 

1 

99.2% 

5 

What is the highest level of education you have completed? 

Dropdown 

125 

1 

99.2% 

6 

How important to your learning is it to have access to technology? 

Multiple Choice 

125 

1 

99.2% 

7 

How capable are you with each of the following application software 
to create and share documents? 

Matrix/ rating 
scale 

125 

1 

99.2% 

8 

How often do you use the following application software? 

Matrix/ rating 
scale 

125 

1 

99.2% 

9 

Which of the following problems you have encountered with the 
application software in Question 8? * 

Multiple Choice 

106 

20 

84.1% 

10 

Have you used a scanner or have you taken a photo/image of a text 
document (which may also contain images/photos)? 

Yes/No 

118 

8 

94.9% 

11 

Have you used an OCR software to convert paper documents into 
editable text files? 

Yes/No 

120 

7 

95.2% 

12 

For which of the following languages have you used OCR software? * 

Multiple Choice 

102 

24 

81.0% 

13 

How often have you used the following OCR software? 

Matrix/ rating 
scale 

114 

12 

90.5% 

14 

How capable are you with each of the following OCR Software? 

Matrix/ rating 
scale 

108 

18 

85.7% 

15 

Which of the following problems have you encountered with the OCR 
Software in Question 14? * 

Multiple Choice 

90 

36 

71.4% 

16 

Do you know that the following features are available with many 

OCR Software? 

Matrix/ rating 
scale 

110 

16 

87.3% 

17 

How do you rank the accuracy of OCR in the following language (If 
applicable)?* 

Matrix/ rating 
scale 

110 

16 

87.3% 

18 

Have you used OCR software to convert scanned handwritten text 
documents into an editable text document? 

Yes/No 

111 

15 

88.1% 

19 

If your answer to Question 18 is yes, How was the accuracy? And in 
which language? * 

Matrix/ rating 
scale 

69 

57 

54.8% 

20 

Please select the statement that indicates how you feel about OCR? * 

Multiple Choice 

106 

20 

84.1% 

21 

Comments/Suggestions/concern or anything you want to add to the 
survey. 

Comment box 

22 

104 

17.5% 


* questions which include an “other” option 


The survey questionnaire starts by collecting information on 
the respondent’s profile, such as gender, age, country of 
residence, education level, and area of expertise. The second 
part of the survey was divided into two parts: application 
software technologies and OCR software. The type of 
questions were mainly multiple choice, yes/no, dropdown, 
matrix/rating scale, and comment box. In addition, some 
questions had an “other” option giving the participants the 
flexibility to provide their own answers instead of choosing 
from the provided ones. The survey was conducted between 
January 18, 2017 and February 28, 2017. The questions used 
in the survey are provided in Table 2. The total number of 
responses were 126, approximately 15.8% of the number of 
emails sent, and this rate is an acceptable response rate and 
according to the projected expectations. During the survey 
period, the author sent several reminders, by email and 
WhatsApp to the contacts; however, this did not help much to 
remind people to answer the survey, and only very few 
responses were seen added to the number of respondents after 
each reminder. The low response rate could be because most 


of the contacts are Doctorate and Master level individuals, and 
they are usually too busy to spare 10-15 minutes to fill the 
survey. 

IV. RESULTS AND ANALYSIS 

The survey is divided into three parts: participants personal 
profile information, such as gender, age, country of residence, 
education level and area of expertise; application software 
technologies; and OCR software. In this section, the 
questionnaire results are presented and analyzed. 

A. Personal Profile Information 

The total number of responses were 126, (69.4% males and 
30.6% females), with two respondents who did not indicate 
their gender. The respondents were from 25 countries and 
covered the five continents, Table 3. The ages of respondents 
are given in Table 4. It is observed that approximately 64% of 
the respondents are between the ages 30 and 50, and 20% below 
30 years of age. This shows that usually individuals who are 
employed and educated may use such software. 
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Table 3: The number of respondents across the continents 


No. 

Country 

Percent 

Reponses 

Response 

count 

Continent 

Percent 

Response 

1 

Algeria 

4.0% 

5 

Africa 

20.2% 

2 

Egypt 

9.7% 

12 

3 

Ethiopia 

0.8% 

1 

4 

Ghana 

0.8% 

1 

5 

Morocco 

0.8% 

1 

6 

Nigeria 

0.8% 

1 

7 

Sudan 

2.4% 

3 

8 

Tunisia 

0.8% 

1 

9 

Bahrain 

0.8% 

1 

Asia 

65.3% 

10 

India 

4.8% 

6 

11 

Japan 

0.8% 

1 

12 

Jordan 

2.4% 

3 

13 

Malaysia 

8.1% 

10 

14 

Oman 

0.8% 

1 

15 

Pakistan 

3.2% 

4 

16 

Palestine 

0.8% 

1 

17 

Qatar 

0.8% 

1 

18 

Saudi Arabia 

41.1% 

51 

19 

Yemen 

1.6% 

2 

20 

Sweden 

0.8% 

1 

Europe 

4.8% 

21 

Ukraine 

0.8% 

1 

22 

United Kingdom of Great Britain and 
Northern Ireland 

3.2% 

4 

23 

Canada 

2.4% 

3 

N. America 

8.1% 

24 

United States of America 

5.6% 

7 

25 

Brazil 

1.6% 

2 

S. America 

1.60% 


*The highlighted countries are from the Middle East. 


The response to question 4, “What is your area of 
expertise/study?” Is given in Table 5. The answer option 
“Other” received the following three responses: Human 
Resources, Information Science and Knowledge Management, 
and information and communications. These three responses 
can be added under the category Computer Science/Computer 
Engineering / IT, increasing the percentage of response to 60%. 
The education/teaching category received 14.4% followed by 
the sciences which received 7.2%. This shows that the main 
categories of respondents were from the IT, engineering and 
sciences sectors with very low responses from other fields. The 
main reason could be that the authors’ contacts were mainly 
from IT and engineering; however, the survey clearly 
mentioned that people from all areas of expertise/study are 
invited to fill the survey, and contacts were urged to forward the 
survey link to their colleagues and friends. 

The level of education of respondents is very high as it can 
be seen from Table 6 that 88% of the respondents have post¬ 
graduate education with 57.6% Doctorate and 30.4% Master 
degree holders. The rest of the respondents: 8% earned or are 
pursing their Bachelor degrees and 4% earned or are pursuing 
their High School diploma. 

The respondents overwhelmingly consider that having access 


to technology is “very important”, and this is very much 
expected given the technology era we are living in, Figure 4. 


Table 4: Statistics on the age of respondents 


Question 3: What is your age? 

Age category 

Response 

Percent 

Response 

Count 

20 or younger 

1.6% 

2 

21-30 

14.3% 

18 

31-40 

35.7% 

45 

41-50 

28.6% 

36 

51-60 

14.3% 

18 

61-70 

4.8% 

6 

70 or older 

0.8% 

1 
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Table 5: Area of expertise/study of survey participants. 


Question 4: What is your area of expertise/study? 

Answer Options 

Response 

Percent 

Response 

Count 

Business and Economics 

3.2% 

4 

Computer Science/Computer 
Engineering / IT 

57.6% 

72 

Education/Teaching 

14.4% 

18 

Engineering 

3.2% 

4 

Government, Law, Politics and 

Policy 

0.8% 

1 

Language and Linguistics 

4.0% 

5 

Medicine, Nursing and Health 
Sciences 

5.6% 

7 

Philosophy, Ethics and Theology 

0.8% 

1 

Sciences (Biology, Chemistry, 
Mathematics, Physics.) 

7.2% 

9 

Visual and Performing Arts and 

Film Studies 

0.8% 

1 

Other (please specify) 

2.4% 

3 


Table 6: Level of Education 


Question 5: What is the highest level of education you have 
completed? 

Answer Options 

Response 

Percent 

Response 

Count 

In High School 

3.2% 

4 

Graduated from high school 

0.8% 

1 

2 year Diploma 

0.0% 

0 

1st year of university 

0.8% 

1 

2nd year of university 

0.8% 

1 

3rd year of university 

0.0% 

0 

4th year of university 

0.0% 

0 

Earned a Bachelor's degree 
(4 or 5 years) 

6.4% 

8 

Earned a Master degree 

30.4% 

38 

Earned a Doctorate degree 

57.6% 

72 


How important to your learning is it to 
have access to technology? 


Very important 


Pretty important 

4.0% 

Important 

0.8% 

Not very important 

0.0% 

Not important at all 

0.0% 


0.0% 20.0%40.0%60.0%80.09100.0% 

Figure 4: How important to have access to technology? 

B. Application software technologies 

This study investigated the capability of individuals with 
application software technology, Question 6, which most of us 
may be using on a daily basis, such as text processing, adding 
links to media files, creating and editing video and audio files, 
publishing websites, ... etc. 

The results show that all participants are capable of creating 
basic text documents with a word processor, ranging from 
advanced to acceptable levels. For creating text documents, 
using advanced features and adding multimedia links into 
documents almost 98% of the participants are capable of 
performing such tasks. Overall, more than 70% of the 
participants are capable of using all the application software 
technologies listed in Table 7. There are some technologies that 
not too many people were exposed to, such as creating and 
publishing a blog or online journal, programming a computer 
application, and designing and publishing websites. 


Table 7: Statistics on how capable are individuals with application software technologies 


Question 7: How capable are you with each of the following technologies to create or share documents? 

Answer Options 

Very 

Capable 

Advanced 

Capable 

Good 

Somewhat 

Capable 

Acceptable 

Not very 
Capable 
Poor 

Never 

Used 

Response 

Count 

Creating text documents with a word processor 

79.70% 

17.10% 

3.30% 

0.00% 

0.00% 

123 

Creating text documents using advanced features such 
as tables, images, formatting, macros... etc. 

70.20% 

25.00% 

3.20% 

1.60% 

0.00% 

124 

Adding links to videos, audio, images into 
documents. 

63.70% 

22.60% 

11.30% 

0.00% 

2.40% 

124 

Finding help from other people on the Internet using 
forms, message boards, social media ... etc. 

33.10% 

37.90% 

19.40% 

4.80% 

4.80% 

124 

Programming a computer application or program for 
others to use 

32.00% 

25.40% 

18.00% 

12.30% 

12.30% 

122 

Sharing or embedding video/audio files on websites. 

29.50% 

36.10% 

20.50% 

4.90% 

9.00% 

122 

Creating and publishing a blog or online journal 

23.60% 

24.40% 

22.00% 

11.40% 

18.70% 

123 

Creating and editing videos 

21.10% 

29.30% 

35.80% 

7.30% 

6.50% 

123 

Creating and editing audio/sound files and recordings 

21.10% 

30.90% 

35.00% 

5.70% 

7.30% 

123 

Designing websites using HTML, CSS, JavaScript 

17.90% 

26.00% 

26.00% 

18.70% 

11.40% 

123 
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Table 8 provided the individual’s usage of application 
software. What is interesting is that social media and word 
processing software are used daily by approximately 75% of the 
participants. Also, writing programs and using imaging 
software are used by approximately 15% of the participants. 
These software are specialized, and given the high level of 
education of these participants, this is expected. It is also shown 


that approximately 50% of the participants use these software 
at least once a month. On the other hand, playing computer 
games, using web development software, and video processing 
software are rarely or never used by most of the respondents 
with more than 60% of the respondents answered “rarely” or 
“never used”. 


Table 8: Statistics on how often individuals use application software 


Question 8: How often do you use the following application software? 

Answer Options 

Daily 

Weekly/ 

Biweekly 

Monthly 

Rarely 

Never 

Response 

Count 

Using Social Media websites (Facebook, Twitter, ... etc) 

75.80% 

14.50% 

3.20% 

6.50% 

0.00% 

124 

Word processing software (MS. Word, Apple iWork,... etc.) 

74.80% 

16.30% 

4.10% 

4.10% 

0.80% 

123 

Writing programs using programming language (C++, Java, .etc.) 

14.60% 

14.60% 

17.90% 

30.10% 

22.80% 

123 

Editing imaging software (paint, Photoshop,... etc.) 

14.40% 

29.60% 

22.40% 

30.40% 

3.20% 

125 

Playing computer games 

8.30% 

6.70% 

15.80% 

40.00% 

29.20% 

120 

Web development software (HTML, CSS, Java Script,.... etc.) 

4.90% 

13.00% 

21.10% 

40.70% 

20.30% 

123 

Video processing/editing software (Movie Maker, iMovie,etc.) 

4.00% 

9.70% 

22.60% 

50.80% 

12.90% 

124 


The responses to the problems encountered when 
downloading application software, Table 9, show that all 
problems listed were encountered by at least 17% of the 
respondents with “the software is not free and I cannot afford 
buying it” receiving the highest percentage, 61.3%. Note that 
in, Question 9, the respondents can choose more than one 
answer, and those who have not encountered any problems may 
choose to skip the question; however, some of the respondents 
chose the answer option “Other” and wrote “nothing”, “never 
used”, or “no problems encountered”. From the “other” 
responses, the respondents did not really state any problems 
except for one response which was “Not all software are fully 
localized for Middle East”, and this is a real problem that many 
may face especially if they are not familiar with English or 
technical terminologies, which are very much needed during the 
installation process of any software. 


Table 9: Problems encountered with application software 


Question 9: Which of the following problems you have encountered 
with the application software in Question 8? (You may choose 
more than one answer). 

Answer Options 

Response 

Percent 

Response 

Count 

Unable to download the software. 

18.9% 

20 

Unable to install the software and decided 
to remove and not use. 

17.0% 

18 

Had problems during installation and had 
to receive help online and/or from a friend. 

17.0% 

18 

The software is not easy to use, the 
interface is not user-friendly. 

24.5% 

26 

I always have to consult manuals, online 
help or YouTube videos (tutorials) for 
help. 

32.1% 

34 

The software is not free and I cannot 
afford buying it. 

61.3% 

65 

Other (please specify) 

13.2% 

14 


C. OCR Software 

In Question 10, it was asked if the individual used a scanner 
or have taken a photo/image of a text document which may or 
may not include text and images. The results show that 94.9% 
of the respondents have used a scanner. This was followed by 
Question 11, “Have you used OCR software to convert image 
files to editable text files?” It was interesting from the results 
to see that over 62% of the respondents used OCR software, 
either through their scanner OCR software or special OCR 
software. This question was followed by Question 12 on the 
languages used for OCR, which listed the top 10 spoken 
languages in the world, [42]. These are: Chinese, Spanish, 
English, Hindi, Arabic, Portuguese, Bengali, Russian, Japanese 
and Punjabi. An “other” option was also provided for the 
participants to include their language, if it is not on the list of 
languages. The “Other” option received 16 responses. The 
results show that the most used languages with OCR were 
English followed by Arabic then French, Table 10. Other 
languages with one or two responses include Spanish, Russian, 
Punjabi, Malay, and Swedish. French, Swedish and Malay 
were under the “Other” answer option where French received 
five responses, but Swedish and Malay one response each. In 
contrast with the rest of the responses, 9 responses of the 
“Other” option were “never”, “not used”, “nothing” ... etc. 
Question 13 examined the usage of the top four best OCR 
software in 2017, and the results are given in Table 11. 
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Table 10: Responses for Question 12, “For which of the following languages 
have you used OCR” 


Question 12: For which of the following languages have you 
used OCR software? (You may choose more than one 
answer). 

Answer Options 

Response 

Percent 

Response 

Count 

Chinese 

0.0% 

0 

Spanish 

1.0% 

1 

English 

83.3% 

85 

Hindi 

0.0% 

0 

Arabic 

44.1% 

45 

Portuguese 

2.0% 

2 

Bengali 

0.0% 

0 

Russian 

2.0% 

2 

Japanese 

0.0% 

0 

Punjabi 

1.0% 

1 

French 

4.9% 

5 

Malay 

1.0% 

1 

Swedish 

1.0% 

1 


Table 11 shows that the best OCR software, Adobe Acrobat, 
has been used by approximately 84% of respondents. On the 
other hand, all other software have slightly, rarely or never been 
used by most of the respondents. A minimum of 73% of 
respondents never used them. Also, it is observed that most of 


those who may have used these software rarely used them, and 
the only software with reasonable daily usage rate is Adobe 
Acrobat (30.7%) followed by ABBYY FineReader (1.9%). 
Similarly, in terms of competence in using the software, 
roughly the same rates for usage are reflected in the capability 
levels of individuals, Table 12. Overall, those who used the 
software daily show more competence in using the software. 
The survey investigated the problems which the users may have 
encountered during the installation process, Table 13. The 
results show that 41% of the respondents may need to use the 
software; however, they are surprised that the software is not 
free, and they cannot afford to buy it. In this study, 41% of the 
respondents consider this as a problem. In addition, 27.8% of 
the respondents faced difficulties, and they tend to consult 
technical support, manuals, tutorials, ... etc., for help in order 
to resolve their issues. The “Other” answer option did not 
receive any responses which are worth mentioning, and most of 
the responses were “none”, “never used”, “Arabic OCR needs 
some work to reach the accuracy of English” ... etc. 

In addition, Question 16 presented the nine most available 
features in OCR software (given in Table 14) and asked the 
respondents if they know that such features are available in 
OCR software. Overall, the results showed that 56% of the 
users are familiar with most of the OCR available features. 


Table 11: Statistics on the usage of OCR software 


Question 13: How often have you used the following OCR software? and how often? (Note, Never, means you haven't heard 
about or used this software before? 

Answer Options 

Daily 

Weekly/ 

Biweekly 

Monthly 

Rarely 

Never 

Response 

Count 

Adobe Acrobat 

30.7% 

14.0% 

8.8% 

29.8% 

16.7% 

114 

OmniPage Standard 

0.9% 

0.9% 

4.7% 

19.6% 

73.8% 

107 

ABBYY FineReader 

1.9% 

2.8% 

0.9% 

14.2% 

80.2% 

106 

Readiris 

1.0% 

1.9% 

2.9% 

17.1% 

77.1% 

105 


Table 12: Statistics on the level of capability of individuals with OCR software 


Question 14: How capable are you with each of the following OCR software you have used? 


Answer Options 

Very Capable 
Advanced 

Capable 

Good 

Somewhat Capable 
Acceptable 

Not very 
Capable Poor 

Never 

Used 

Response 

Count 

Adobe Acrobat 

33.3% 

30.6% 

15.7% 

6.5% 

13.9% 

108 

OmniPage Standard 

2.9% 

8.6% 

8.6% 

8.6% 

71.4% 

105 

ABBYY FineReader 

2.9% 

5.8% 

2.9% 

7.7% 

80.8% 

104 

Readiris 

2.0% 

8.8% 

5.9% 

6.9% 

76.5% 

102 


Table 13: Problems encountered with OCR software 


Question 15: Which of the following problems have you encountered with the OCR software in Question 14? 
(You may choose more than one answer). 

Answer Options 

Response 

Percent 

Response 

Count 

Unable to download the software. 

13.3% 

12 

Unable to install the software and decided to uninstall and not use. 

10.0% 

9 

Had problems during installation and had to receive help online and/or from a friend. 

13.3% 

12 

The software is not easy to use, the interface is not user-friendly. 

7.8% 

7 

I always have to consult manuals, online help or YouTube videos (tutorials) for help. 

27.8% 

25 

The software is not free and I cannot afford buying it. 

41.1% 

37 

Please enter your statement 

22.2% 

20 
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Question 17 asks the respondents to rank the accuracy of the 
OCR software for the languages they have used. It was very 
surprising that over 91% of the respondents never used OCR 
for any language other than English and Arabic. 
Approximately 50% of the respondents ranks OCR for English 
as excellent, and 14.3% ranks Arabic as Excellent. However, 
23.8% ranks OCR for English and 19.4% ranks OCR for Arabic 
as good. It is also noticed that more respondents used English 
OCR compared to any other language, this is followed by 
Arabic. The results may not be accurate and cannot be 


generalized since most of the respondents were from the Middle 
East, and very few people represented other languages 
mentioned in Table 15. 

To further explore the use of OCR software, Question 18 
asked if the respondents used OCR software to convert 
handwritten text documents into editable text documents. The 
results show that about a third of the respondents (32.4%) used 
OCR for handwritten documents. Also, Question 19 presents 
the OCR software accuracy level for handwritten documents. 


Table 14: OCR Features 


Question 16: Do you know that the following features are available with many OCR software? 

Answer Options 

Yes 

No 

Response 

Count 

Multi language support 

72.2% 

27.8% 

108 

The output file retains layout of original scanned paper document 

63.2% 

36.8% 

106 

The output file retains Fonts of original scanned paper document. 

51.9% 

48.1% 

106 

The output file retains Tables from original scanned paper document 

52.8% 

47.2% 

106 

OCR software can de-skew (rotate) image 

50.9% 

49.1% 

106 

Multi-page document recognition 

58.1% 

41.9% 

105 

Integrates with Cloud Storage. 

41.0% 

59.0% 

105 

Integrates with MS Office applications. 

66.4% 

33.6% 

107 

Integrates with HTML. 

42.5% 

57.5% 

106 

Average 

56% 

44% 

106 


Table 16, provides the results for Question 19, presenting the 
estimated accuracy from the user’s perspective for the 
languages used by the respondents. The results show that a 
minimum of 93% of respondents never used OCR for 
handwritten documents. The accuracy level, as expected, is 
very low compared to printed documents. Here, English 
received the highest accuracy response rate of 18.8% as 
“Excellent” i.e., above 85% accuracy, followed by Arabic, 
10 . 6 %. 


To conclude, the respondents were asked to provide the 
statement/s on how they feel about OCR. Table 17 shows the 
statement, “OCR is a very important technology and must be 
available as a feature in all text/image processing software,” 
receives the highest response with 78.3%, followed by the 
statement, “OCR is very important to be used by universities 
and public libraries to convert old documents to digital 
documents,” with 41.5%. 


Table 15: Ranking the accuracy of OCR software for different languages using printed documents 


Question 17: How do you rank the accuracy of OCR in the following languages? if applicable? 

Answer 

Option 

Excellent 
(Above 85%) 

Good 

(70% - 84.9%) 

Acceptable 
(50% - 69.9%) 

Poor 

(Below 50%) 

Never 

Used 

Response 

Count 

Chinese 

0.0% 

3.6% 

2.4% 

1.2% 

92.8% 

83 

Spanish 

0.0% 

3.8% 

2.5% 

0.0% 

93.8% 

80 

English 

49.5% 

23.8% 

9.5% 

4.8% 

12.4% 

105 

Hindi 

1.2% 

3.7% 

2.4% 

2.4% 

90.2% 

82 

Arabic 

14.3% 

19.4% 

16.3% 

11.2% 

38.8% 

98 

Portuguese 

0.0% 

6.0% 

1.2% 

0.0% 

92.9% 

84 

Bengali 

0.0% 

1.2% 

2.4% 

2.4% 

93.9% 

82 

Russian 

0.0% 

2.5% 

4.9% 

1.2% 

91.4% 

81 

Japanese 

0.0% 

3.7% 

1.2% 

2.4% 

92.7% 

82 

Punjabi 

0.0% 

1.2% 

3.7% 

2.5% 

92.6% 

81 
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Table 16: Ranking the accuracy of OCR software for different languages using handwritten documents 


Question 19: If your answer to Question 21 is yes, How was the accuracy? and in which language? 

Answer Option 

Excellent 
(Above 85%) 

Good 

(70%-4.9%) 

Acceptable 
(50% - 69.9%) 

Poor 

(Below 50%) 

Never 

Used 

Response 

Count 

Chinese 

0.0% 

3.5% 

0.0% 

0.0% 

96.5% 

57 

Spanish 

0.0% 

1.8% 

0.0% 

1.8% 

96.5% 

57 

English 

18.8% 

17.4% 

14.5% 

4.3% 

44.9% 

69 

Hindi 

1.8% 

0.0% 

0.0% 

5.4% 

92.9% 

56 

Arabic 

10.6% 

9.1% 

12.1% 

6.1% 

62.1% 

66 

Portuguese 

0.0% 

1.8% 

0.0% 

0.0% 

98.2% 

57 

Bengali 

1.8% 

0.0% 

0.0% 

1.8% 

96.5% 

57 

Russian 

0.0% 

1.8% 

0.0% 

1.8% 

96.5% 

57 

Japanese 

0.0% 

1.8% 

0.0% 

1.8% 

96.5% 

57 

Punjabi 

0.0% 

0.0% 

1.8% 

0.0% 

98.2% 

57 


Table 17: Responses to “Select the statement that indicates how you feel about OCR” 


Question 20: Please select the statement that indicates how you feel about OCR? (You may choose more than one statement). 

Answer Options 

Response 

Percent 

Response 

Count 

OCR is a very important technology and must be available as a feature in all text/image processing software. 

78.3% 

83 

OCR is a technology that is not mature enough to be used for some languages which written from right to left, 
such as Arabic. 

21.7% 

23 

OCR is a technology that is not mature enough to be used for many languages. 

16.0% 

17 

OCR is acceptable to be used for printed text; however, with handwritten text it is preferred not to be used. 

19.8% 

21 

OCR is very important to be used in government offices to convert handwritten documents to editable files. 

22.6% 

24 

OCR is very important to be used by universities and public libraries to convert old documents to digital 
documents. 

41.5% 

44 

OCR is a saver for old documents and manuscripts from being lost/stolen/bumed .... etc. 

36.8% 

39 

Other (please specify) 

8.5% 

9 


Finally, the respondents were given the chance to provide 
comments and/or suggestions about the survey. From the 
responses, almost all of the comments were “thank you”, 
“nothing”, and “good luck” comments except for the following 
three interesting comments: the first comment was “Thanks for 
providing me an opportunity to be part of your research and 
share my contribution. I think that expertise of university 
teacher’s use of software and its related knowledge depends 
upon the availability of software in the institution. If an 
institution has enough sufficient software packages, then its 
next responsibility is to train its employees in using that 
software. So, institutions should have latest software and 
expertise in using and providing training in use of that 
software.” The second comment was “I am not familiar with 
most of the software, and I think it is useless to have these many 
software”. The third comment is, “I only use the software 
provided by the operating system, and those available on my 
computer at work.” These comments summarize the reasons 
why people are not familiar with many software. We can say 
that we spend most of our time at work, and this influence the 
type of software we may use depending on our areas of 
expertise. 

V. Limitations and Discussion 

In general, document analysis software is very essential. 
Hence, OCR reduces time for processing data collection, which 


if done manually (data entry), takes longer time and is prone to 
human errors. The study findings may only benefit very limited 
number of people who are interested in English/Arabic OCR 
and image editing software and, therefore, not too many people 
participated in this survey. In addition, the low response rate 
could be because there are no monetary incentives for filling the 
survey. The results show that people were overwhelmed with 
the number of available software which most of them never 
heard about. Even though the Top four OCR software were 
listed, the results show that approximately 73% of the 
respondents never used or even heard about these software. 

There are several limitations in the study findings. The study 
results cannot be generalized since the data collection was not 
purely random collected from faculty members in the IT sector 
to widen the sampling group to include individuals from all 
different sectors. The study reported approximately 58% of 
respondents are from the IT sector. The survey was distributed 
online to people who are familiar with using the Internet, 
sending emails, and using social media. The data collection 
was geographically distributed all over the world; however, 
58% of the respondents were from the Middle East and about 
71% of the Middle East respondents were from Saudi Arabia. 

The accuracy rate reported by the commercial OCR software 
does not reflect the results obtained in this study. For example, 
the study reported the recognition rate for English, Latin script, 
as excellent with only 50% of the respondents, which is still not 
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100% accuracy as reported in [40]. Therefore, further 
investigation of the accuracy rates is needed for all language 
scripts including Latin scripts. 

In the future, this study needs to be distributed to a wider 
range of people especially young people who are technology 
talented, and this survey did not reach them. In this study, 20% 
of the respondents were below 30 years old with only 1.6% 
were 20 or younger. Although this study provides insights into 
the familiarity of individuals with document and OCR 
technology corresponding to the age and gender, future research 
support is needed to understand the familiarity and capability of 
individuals with different software based on different age 
groups and gender. 

The area of expertise and employment place of an individual 
pretty much have an influence on most of the types of software 
used in addition to the influence of the work environment which 
also influences the type of social media used. Another 
important finding from the survey is that technical support for 
the software in multi-languages, especially Arabic, is not 
available, which may be a factor that such software is not 
popular to use. 

Currently, companies are providing free or limited usage of 
online OCR service or OCR mobile apps. This provides a real¬ 
time processing which can be used anytime from a personal 
computer or a mobile device. Future study should also evaluate 
this service and its possible replacement for current software 
installed on desktop computers. 

A framework for testing OCR software is urgently needed to 
be able to test multilingual document analysis and OCR 
software. Databases of documents to test this software for all 
languages need to be developed in a similar fashion to be able 
to test different languages and provide comparable results for 
any language under any OCR platform. Therefore, the author 
recommends that more research is encouraged for different 
languages other than Latin, where the recognition rates are far 
below from those reported for Latin scripts, especially cursive 
scripts and printed handwritten scripts for all languages 
including Latin. 

VI. Conclusion and Future Work 

The OCR technology went through a major progress has 
progressed a great amount since it started in the early years of 
the last century. The study has confirmed that document 
analysis software and OCR, in particular, are very important 
and essential technology, and its use is seen in many 
applications around us. The recognition of cursive text in any 
language is still an active area of research. Higher recognition 
rates for printed or handwritten text is not possible without the 
use of contextual or grammatical information. The work carried 
out in this paper is the first of its kind and it sheds some light 
on the user perspective and their familiarity with document 
analysis software, especially OCR. The study concludes that a 
framework for testing OCR software is urgently needed to be 
able to test multilingual document analysis and OCR software. 
Finally, more research is encouraged for different languages 
other than Latin scripts. 
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Abstract- The digital authentication, certification and standardization of Information Technology (IT) 
applications of the Holy Quran has attracted researchers and organizations to explore its importance and 
requirements. This is an emerging area of research that needs more attention since it investigates sensitive 
digital documents and applications. The objective of this study is to gather information from users, developers, 
designers, decision makers and organizations on their perception on authentication, certification and 
standardization of IT applications for the Holy Quran. This study is based on a questionnaire survey which was 
distributed online to highly education individuals from all over the world. The study emphasized the need for 
more security and quality assurance in IT Quran applications. In addition, standardization and certification 
through authentic Quran organizations are encouraged to validate the Quranic content. The study results and 
analysis were extracted from over 80 questions divided between six categories of participants. The survey 
received 500 responses from 31 countries and five continents. The findings of the study are considered a step 
towards the importance of secure sensitive documents in all different areas of research particularly digital 
Quran content. 


Keywords- Quran, Authentication, Security, Quality of Service, Standards 


I. Introduction 


Quran is the sacred book of Islam which was revealed to Prophet Mohammed (peace be upon him) over 
14 centuries ago. Quran is usually read from paperback format called Mushaf; however, nowadays the 
recent advancements in technology allowed the use of smart digital gadgets which are portable and adopted 
by almost everyone [1]. Therefore, the rapid advancements in technology with the worldwide spread of the 
Internet allowed for the dissemination of online digital multimedia content. The increase in speed and 
storage capabilities allowed for a rapid increase in the digitization of all sorts of multimedia publications 
especially the digitization of printed documents has seen an exponential jump as archives are digitized and 
offices are eliminating papers and becoming paper free. Therefore, Quranic organizations and research 
centers exploited this movement of digitization to benefit in the development of Quranic Web and mobile 
apps. In this section, the most recent research work related to Quan computing i.e., security, authentication 
and standardization is presented. 

Due to the sensitivity of Holy Quran, it is very crucial to authenticate the verses or parts of the Quran 
available through webpages to avoid intentional and unintentional distortion/forgery to Quranic verses. 
The goal is to use information retrieval techniques to check for documents that include unauthentic Quranic 
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verses, therefore, authentication techniques are required to check the world-wide web for any unauthentic 
Quranic verses. Thus, this is possible but false results may appear if other different Quranic readings text 
which may be found in some websites [8]. 

The work by Tayan and Alginahi, conducted a survey to investigate the significance of online and offline 
software applications and their benefits and effectiveness for users. The study showed that approximately 
80% of participants had used technology in Quran memorization, technology is more appealing to the 
younger generation and 50% of the participants preferred Internet/software or handheld/portable 
technologies as compared with traditional methods. The study concluded with several challenges from the 
user-perspective these are: English-language barrier since many applications are in English, Technology 
illiteracy, limited Internet resources and connectivity. In addition, the participants provided suggestions for 
new applications and modifications to existing technology [2]. In another work, Tayan and Alginahi 
presented a review of different multimedia watermarking techniques which are applicable to sensitive 
digital content and discusses how those approaches can accomplish the required protection [3]. 

The work in [4] presented a review on the types of Information Security (IS) aspects which include; data¬ 
storage, in-transit data and data access prevention for unauthorized users. The work presented the main 
challenges in IS with regards to vulnerabilities and data-breaches in addition to discussing the mechanisms 
for enhancing data protection, IS policies and standards for protecting data content [4]. 

The survey in [5] provided a review on recent researches on Digital Holy Quran authentication, 
protection and integrity authenticity, it focused on analyzing and categorizing the existing research related 
to preserving and verifying the content integrity of the Holy Quran. It concluded by a recommendation to 
develop a reliable universal database of authentic and verified Digital Quran and hadith content, in 
addition, to developing a Real-Time Quran Verse Detection Expert System with improved accuracy and 
precision [5]. 

The research paper “Digital Quran Computing: Review, Classification, and Trend Analysis” by Zakariah 
et al. provided a subject/theme categorization of digital Quran research based on the topical trends then a 
discussion of their key features, limitations, and research directions. Next, a set of recommendations 
regarding security and authentication, standardization and quality of service, unified translation, and E- 
Learning Approaches and Quran Knowledge base were presented. Finally, the authors concluded by 
discussing the open challenges in digital Quran Computing [6]. 

In [7], Sabbah and Selamat proposed a framework authenticity detection method for text Quranic verses 
extracted from online forum posts based on computing numerical Identifiers of words in the detected text 
then comparing these identifiers with Identifiers of original Quranic manuscript. The results show that the 
average accuracy rate was 62%, precision 75% and recall 78%. The authors stated that their future work 
will incorporate computational intelligence methods to increase the authentication which will also involve 
other Quran media such as sound, images and video to improve the detection. 

Alsmadi and Zarour designed a tool to evaluate the integrity of the wording in the e-versions of the 
Quran by generating the metadata related to all words in the Quran i.e., to preserve the counts and 
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locations. This is like hash algorithms used in checking the integrity of data files in a disk, as a result a tiny 
modification will result in a different hash function. The results show that hashing verification can be a 
good candidate for the automatic Quran authentication process with high confidence [8]. 

The exponential increase in the number of websites on the Internet the vulnerability of sensitive 
information, such as digital Quran, maybe at stake. From the huge number of Islamic websites not many of 
them present digital copies of the Quan that are verified and certified by an authentic source. The work by 
Mostafa and Ibrahim propose a computer-based system on the public key infrastructure and the digital 
signature to secure and verify the content of Holy Quran script on the web [9]. The authors claim its 
applicability to the application of the Holy Quran. 

Digital copies of fake or distorted Quran have been detected online and in some Quran mobile apps. 
Therefore, to eliminate such forgery, by authentication, which may be intentional or unintentional will 
solve most the threats and provide confidence in users to use such web and mobile applications. The study 
in [10] presents a comprehensive research survey of works conducted in the area of Quran authentication 
from an information security perspective. This study concluded the urgent need to provide a good content 
security, and integrity of digital Quran [10]. 

Cryptography have been used to secure data against tampering attempts and protects the data integrity. 
The work by AlAhmad, Alshaikhli, and Jumaah proposed a cryptography algorithm (Combination between 
AES and RSA Cryptography Algorithms (CARCA)). The results show that the CARCA method, with the 
two encryption algorithms, showed a boost and improvement in the protection of the Digital Holy Quran 
Hash Digest [11]. 

The works in [12 - 13] presented a watermarking method to enable the authentication and detection of 
the image forgery on the digital Quran images. The proposed method uses two layers of embedding 
scheme. First, the discrete wavelet transforms are applied to decompose the host image into wavelet prior 
to embedding the watermark in the wavelet domain. Then, the watermarked wavelet coefficient is inverted 
back to the spatial domain where the least significant bits are utilized to hide another watermark. Next, a 
chaotic map is used to blur the watermark to secure it against the local attack. This technique provides high 
watermark payloads, though preserving the image quality. 

On the issue of certification, Khan, Siddiqui and Tayan presented a digital Quran certification framework 
by utilizing modern digital authentication and certification techniques i.e., a certification authority and 
religious scholars follow a rigorous procedure that checks the requirements process for certification and 
upon approval a digital certificate is issued for the application. This framework controls the digital content 
of the Holy Quran and doesn’t allow any modifications by the users [14]. 

To the best knowledge of the authors the literature provides a good quality of security and authentication, 
however, very scarce work on Quran certification and no work available on standardization of digital 
Quran. As presented in this section most of the available Quran authentication techniques have been used 
for the purpose of research and no commercial implementation of such techniques is provided. On the 
issue of certification digital copies of the Quran what is available is a framework for certification. On the 
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other hand, the issue of standardization has not been address in literature. Therefore, the available research 
work is considered a positive step for researchers in this area to come together and set procedures and 
standards for security and authentication, certification and standardization of digital Quran. 

In this paper, the security and authentication, certification, and standardization challenges regarding the 
development of Quran apps are presented from the perspective of users, developers, designers, decision 
makers and institutions. The study was based on a survey conducted to gather information on technology 
adoption for reading the Quran, security and quality control issues in development of Quan software 
(mobile and web applications), in addition to standardization issues of institutions and software. This work 
is unique as it gathered feedback/perspective from six different categories to identify all aspects of 
authentication, certification and standardization from the perspective of the users, developers and 
programmers, designers, managers, organizations and publishers. 


II. Methodology 

A semi-structured questionnaire survey was developed using monkeysurvey.ca and distributed online to 
the authors contacts who are mainly highly educated individuals from all over the world. The survey was 
conducted to discover This survey is delivered and distributed using a range of social media tools, and its 
findings are taken to characterize sampling of English/Arabic speaking users of Quran products. The main 
objective from this study is gathering the perspective/feedback of different categories of individuals and 
organization on the requirements and importance of digital authentication, Certification and 
Standardization of IT applications for the Holy Quran. Before the survey was conducted it was piloted to 
ten subjects who were interviewed so that to provide their feedback on the questions of the survey and 
comment on its clarity. Next, the semi-structured survey was designed then distributed online to over 3500 
emails and social media accounts including WhatsApp. The response rate of the survey was approximately 
14%. The survey was distributed in both English and Arabic. The total number of respondents was 500, 
from whom 307 filled the English survey and 193 filled the Arabic survey. The survey was designed using 
monkeysurvey.ca, the questions types include, Yes/No, Multiple choice Questions, and open-ended 
questions. 


TABLE 1 


THE PERSONAL PROFILE QUESTIONS ASKED 


No. 

Question 

Type of Question 

Completed Responses 

Skipped Responses 

Arabic 

English 

Arabic 

English 

1 

What is your gender? 

Multiple Choice 

184 

300 

9 

7 

2 

Which age group do you belong to? 

Multiple Choice 

184 

302 

9 

5 

3 

What is your highest qualification? 

Multiple Choice 

184 

301 

9 

6 

4 

Your employment status? 

Multiple Choice 

181 

302 

12 

5 

5 

What is your Nationality? 

Dropdown 

173 

292 

20 

15 

6 

Which category/occupation do you belong 
to? 

Multiple Choice 

185 

289 

8 

18 


The survey is divided into two parts, the first part includes personal questions and the second part 
includes specific questions targeting certain categories of individuals and organizations by requesting them 
to provide their feedback on questions related to requirements and importance of digital authentication, 


101 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 
















International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


certification, standardization of IT applications for the Holy Quran. The survey with questions on the 
personal profile of the respondents, including, age, gender, nationality, employment status and education 
level. Table 2, shows that the highest response rate for this survey was from the age group 35 - 44 years old 
with approximately 41%, followed by 28% for the age group 45 - 54, then 19.8% for the age group 45 - 55 
and 8.2% for the age group 55 - 64. It is noticed that the age groups below 25 and above 65 counted for 
about 3% of the respondents. 


TABLE 2 

AGE GROUPS OF PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

15-24 

2.1% 

10 

25-34 

19.8% 

96 

35-44 

40.9% 

199 

45-54 

28.0% 

136 

55-64 

8.2% 

40 

65+ 

1.0% 

5 


From both surveys (Arabic/English) conducted, it was noticed that 44.6% of the participants are females 
and 55.4% are males. Table 3 shows that the minimum completed education level was a high school 
diploma and approximately 89.3% of the participants have a bachelor or graduate degree with 
approximately 51 % hold a Doctorate degree which shows that most of the participants are highly educated. 


TABLE 3 

THE EDUCATION LEVEL OF PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

High School 

2.9% 

14 

Associate/Vocational/Trade Degree 

0.2% 

1 

Certified Professional/Diploma Holder 

0.2% 

1 

Bachelor Degree Holder 

5.2% 

25 

Master Degree Holder 

26.2% 

127 

Doctorate Degree Holder 

50.9% 

247 

Postdoctoral 

12.2% 

59 

Prefer not to say 

2.3% 

11 


The employment status of the participants shows that 91% are employed full-time, part-time or self- 
employed, 6% are students, 0.4% are retired, 1.4% unemployed and 1.2% preferred not to say. The 
nationalities of the participants included 31 countries from five continents. The following are the countries 
of the participants: United States (3), Algeria (38), Australia (4), Bahrain (3), Bangladesh (2), Canada (7), 
China (1), Egypt (34), Ethiopia (1), India (33), Indonesia (4), Iran (3), Iraq (7), Jordan (65), Kuwait (6), 
Lebanon (3), Libya (4), Malaysia (78), Morocco (9), Nigeria (3), Oman (7), Pakistan (54), Palestine (10), 
Saudi Arabia (23), Sudan (18), Spain (1), Syria (10), Tunisia (4), Turkey (3), United Kingdom (10), and 
Yemen (17). The number between parenthesis indicates the number of participants from each country. 

After responding to the personal profile section of the survey, the participants have to choose the 
category they belong to from the six categories shown in Table 4 before moving to the second part of the 
survey and answer the corresponding questions related to that category. Table 4, provides the categories 
and the number of respondents in each category. 
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TABLE 4 

THE CATEGORIES OF PARTICIPANTS 


Name of Category 

Response Percent 

Response Count 

1- A general IT/Software User or Advanced User 

47.3% 

224 

2- Software/Mobile/Digital Application Developer 

18.4% 

87 

3- Graphics Designer 

1.9% 

9 

4- Organization Manager/Director/Decision Makers/Owners/CEOs 

22.8% 

108 

5- Islamic Institutes 

6.8% 

32 

6- Holy Quran Publishers 

3.0% 

14 


Even though Table 4 shows that 95% of the 500 participants completed the personal profile by choosing 
a category only 350 participants, i.e., 70% of the total no. of participants, completed the questions of the 
category they chose in Table 4. Finally, Table 5 provides the responses for each category. The results show 
that 51.4% are general IT/software users, 18.3% are software developers, 22.3 managers and decision 
makers, with less than 7% Islamic institutes and Holy Quran publishers. 


TABLE 5 

THE RESPONSE RATE OF PARTICIPANTS FOR EACH CATEGORY 


Name of Category 

Response Percent 

Response Count 

1- A general IT/Software User or Advanced User (15 Questions) 

51.4% 

180 

2- Software/Mobile/Digital Application Developer (16 Questions) 

18.3% 

64 

3- Graphics Designer (8 Questions) 

1.1% 

4 

4- Organization Manager/Director/Decision Makers/Owners/CEOs (15 Questions) 

22.3% 

78 

5- Islamic Institutes (9 Questions) 

5.7% 

20 

6- Holy Quran Publishers (11 Questions) 

1.1% 

4 


III. Results Analysis & Discussion 

This section presents the analysis and discussion for the results from the six survey categories. 

A General IT/Software User or Advanced User 

The list of questions for this category is provided in Table 6. 


TABLE 6 

THE GENERAL IT/SOFTWARE USER OR ADVANCED USER CATEGORY QUESTIONS 


No. 

Question 

Response 

Count 

1 

How often do you recite Qur’an? 

180 

2 

Do you also recite Qur’an on-line? 

180 

3 

What is the reason for not reciting the Qur’an on-line? (if the previous answer is NO) 

69 

4 

Is your level of confidence in digital Quran authenticity equal to when compared with printed Quran books? 

180 

5 

Do you use Digital Devices to read Qur’an (Smartphones/Tablets/ digital diary etc.)? 

181 

6 

Why you do not recite the Qur’an on a digital device? (if the above is NO)* 

47 

7 

In your opinion, are the current digital copies of the Quran available on different digital devices authentic? 

179 

8 

In your case, do you prefer a digitally signed and 100% authentic copy of the Quran on-line or on digital 
device? 

180 

9 

Did you ever encounter a fake copy of Qur’an available on-line or on a digital device?* Please provide URL. 

177 

10 

What are the main facilities/services you use from online, web-based or smartphone Quran apps?* 

186 

11 

Do you think it is necessary for a Quran Authentication Body to monitor and endorse the digital copies of 
the Qur’an Worldwide? 

181 

12 

Are you pleased with the features of most online/smartphone Quran apps? 

176 

13 

Please state the name of your favorite Quran Application smartphone or URL: 

72 

14 

Are you pleased with the quality of information provided by most apps? 

175 

15 

What is your vision/comments/recommendations of how new technologies could be used to develop and 
improve Quranic/Related Smartphone Apps, Web Apps or user interaction further? 

79 


* Question has an “Other” option or “comment box” 
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The results show the users response to questions related to their personal use and confidence in digital 
Quran, in addition to their perception on authenticity. Table 7 provides the response of the participants to 
the question “How often do you recite Quran?” It shows that over 65% recite the Quran daily or on 
alternative days and 14.4% recite it weekly which means that at least 80% of respondents reads Quran at 
least once a week. Also, 62.2% recite Quran online besides reading from a paper copy of the Quran, and 
37.8% only prefer reading from paper copies of the Quran. From those who do not read Quran online, 
92.8% feel uncomfortable reading from a computer or smart mobile screen and 7.2% specified that they are 
not Internet users. However, the response to “Do you use digital devices (Smartphones/Tablets/Digital 
diary ...etc.) to read Quran?” shows that 82.3% use digital devices compared to 17.7% who do not use any 
gadgets to read Quran. The reasons for not reading from digital devices include: medical reasons, feeling 
uncomfortable, feeling uncertainty of the authenticity of the Quran, prefer to listen from Musha’af (paper 
copy of the Quran), and getting more benefits from reading from Musha’af. These results emphasize the 
importance of reading the Quran regularly and the need to provide ways to make it easier to recite and 
memorize it. 


TABLE 7 

RESPONSE OF THE PARTICIPANTS TO THE QUESTION “HOW OFTEN DO YOU RECITE QURAN?” 


Answer Options 

Response Percent 

Response Count 

I have memorized the Qur’an by heart 

2.2% 

4 

Daily 

46.7% 

84 

Alternate days 

16.7% 

30 

Once/Week 

14.4% 

26 

Once/Month 

4.4% 

8 

Once/Year 

1.7% 

3 

Rarely 

5.6% 

10 

Prefer not to say 

8.3% 

15 

Doesn’t apply to me 

0.0% 

0 


The level of confidence in digital Quran authenticity compared with printed Quran paper back could be 
another reason why people do not use or trust digital copies of the Quran. The results show that 41.6% of 
the respondents never had any peculiar feelings at all in regard to the level of confidence in digital Quran 
authenticity, 20.6% always feel that the contents may be produced by an un-authentic source, 17.8% always 
have a feeling that the contents may be forged/modified or not properly scrutinized and 21.1% never 
thought about it. 

Responding to the question “In your opinion, are the current digital copies of the Quran available on 
different digital devices authentic?” the response shows that 21.2% believe that the digital copies of Quran 
are authentic and 7.3% think otherwise. However, 60.3% are not sure and 11.2% never thought about it. 
This shows that many people assume what they download from app stores is authentic and such an issue 
does not cross their minds. In addition, most of the people do not memorize the Quran and may not even 
notice if there is a mistake. Therefore, the next question “do you prefer a digitally signed and 100% 
authentic copy of the Quran on-line or on digital device?” which addresses authenticity shows that 85% 
think it is very essential and will give authenticity a very high preference, 12.8% agree and may prefer 
100% authenticity and 2.2% do not prefer digitally signed copy of the Quran. In response to the Question 
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“Did you ever encounter a fake copy of Qur’an available on-line or on a digital device?” 12.4% mentioned 
that they encountered a fake copy of the Quran. The True Furqan is one main example provided by 
participants, some asked to refer to google images and others could not recall the names of the websites that 
they encountered a copy of fake or modified Quran. This raises the question “Is it necessary for a Quran 
Authentication Body to monitor and endorse the digital copies of the Qur’an Worldwide?” and as expected 
the response of the participants highly emphasize this with 98.9% giving it a high preference. 

The use of Internet or digital devices to recite the Quran also helps people use other facilities or services 
that are available within online Quranic/Islamic websites and/or digital apps. Table 8 shows that besides 
reciting the Quran, reading the Tafseer (Quran explanation) and using search tools are the most used 
services. Besides the services provided in Table 8, other services include learning Quranic Arabic, reciting 
Zikr (supplications), listening to YouTube, finding prayer times and Qibla direction, preparing Islamic 
lectures and sharing Quranic services on social media. The response count is more than the number of 
participants since this question allows for multiple answers. 


TABLE 8 

THE MAIN FACILITIES/SERVICES THE PARTICIPANTS USE FROM WEB AND MOBILE APPLICATIONS 


Answer Options 

Response Percent 

Response Count 

Quran text & recitation only 

27.3% 

143 

Tafseer 

23.1% 

121 

Use of those apps to help memorize Quran 

7.6% 

40 

Le’rab (Grammar) of the Holy Quran. 

5.0% 

26 

Search Facility 

19.5% 

102 

Phonetic Search 

5.7% 

30 

Statistical Analysis 

4.8% 

25 

Retrieving, Printing contents or resources. 

5.9% 

31 

Other (Please specify)- 

1.1% 

6 


The survey also shows that 70.5% of participants are pleased with the features of most 
online/smartphones Quan apps and 73% are pleased with the quality of information provided by most apps. 
Therefore, there are many Quran apps and websites which the participants favor to use, however, the 
mostly mentioned are: Ayat (provided by King Saud University http://quran.ksu.edu.sa/), iQuran Lite (An 
IOS app), Quran Android, Quran Explorer http://www.quranexplorer.com/quran/, and Tanzil 
(http://tanzil.net/). To conclude this category the participants were asked to provide their 
comments/suggestions/vision on how new technologies could be used to develop and improve Quran 
related smart apps and web applications. Finally, the participants provided their comments and 
recommendations on how new technologies can be used to develop and improve Quranic related 
smartphone and web apps. The participants provided many comments, suggestions and recommendations. 
Some of the design features the participants would like to see in Quran apps are: zooming, contextual 
search, adjustable font size, high sound quality, multi-language, Tajweed learning, contact a scholar, daily 
random lessons and stories, pop up Word by word meaning while reciting Quran, Daily alarm/reminder, 
quiz, game-based feature to keep motivation, statistical analysis, advanced research and navigation 
features. Other features or apps the participants would like to see for or in Quran apps include, interactive 
and user-friendliness, offline accessibility, free of charge, robustness, ad free, free of charge, encapsulates 
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different features in one app such as qibla direction, prayer time ... etc., user friendly voice-driven apps, an 
app that could match reciter's voice with the database then give rating of the quality of the recitation, 
features for people with special needs. Regarding security and authentication, the participants emphasized 
that authentication should be ensured as it increases the confidence of the user with secured apps. In 
addition, the participants provided the following comments and suggestions: 

• Develop one global source of authentic digital Quran provider. 

• Use cryptographic algorithms to ensure data integrity. 

• Develop a security tool that is compatible with most Quran apps to be used to check originality of 
the Quran. 

• Develop a secure cloud-based repository for Quran related resources and make it available as a 
web service for all mobile and web apps. Any Quran related resources must be first authentic 
before being used in designing apps. 

• One or more Islamic authentication companies must check all available digital apps and issue 
them tags for authentication or rejection, which will make the users aware of authentic apps. 

• Provide in app capability for real time authentication. 

• Design a plug in that checks Quran verses online. 

• Develop an agent program to keep monitoring the Quran text for correctness. Such text may be 
any uploads related to Quran with the help of popular social media and other websites that may 
contain such contents. 

B. Software/SmartPhone/Digital Application Developer 

The list of questions for this category is provided in Table 9. The results provide the users response to 
questions related to their personal experience as web and mobile application developers. 


TABLE 9 

THE “SOFTWARE/SMARTPHONE/DIGITAL APPLICATION DEVELOPER” CATEGORY QUESTIONS 


No. 

Question 

Response 

Count 

1 

Are you involved in Digital Application Development for Smart devices or Web-Apps? 

64 

2 

Did you ever develop any online Quran application that involved reading or reciting Quran using a Smart 
device? 

64 

3 

Do you think the digital Quran which you have used in your application is 100% authentic and is free of any 
intentional/unintentional errors? 

63 

4 

Do you consider “Content or Quranic Script Integrity/Authenticity” as a Primary Measure while developing 
a Quran application? 

63 

5 

Do you apply Modern Image Processing or Signal processing techniques to prevent any tampering attempts? 

62 

6 

What type of Signal-Processing or Image-Processing tools do you use?* 

53 

7 

Do you follow any particular information security standards while developing applications?* 

60 

8 

Which Security Standards do you follow? (If the previous answer is YES)* 

26 

9 

Do you consider Secure Coding while developing such applications? 

60 

10 

Did you ever witness any case in which the integrity/security of a Digital Quran application was 
compromised due to unsecure coding or development?* 

61 

11 

Do you follow Quality Assurance standards while developing such applications? 

59 

12 

Which Quality Standards do you follow? (If answer for the above question is YES)* 

27 

13 

Will you consider to apply Security Standards, Techniques and Measures to prevent any type of forgeries 
while developing Quran Applications? 

60 

14 

Do you think it is necessary for a Quran Authentication Body to monitor and endorse the digital copies of 
the Qur’an Worldwide? 

61 

15 

Will you prefer validating your Quranic resources through a Quran Authentication Body? 

61 

16 

What is your vision/comments/recommendations of how new technologies could be used to develop and 
improve Quranic/Related Smartphone Apps, Web Apps or user interaction further? Please specify: 

26 


* Question has an “Other” option or “comment box” 
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The results of the survey show that 73.4% of the participants are involved in Digital Application 
Development for Smart devices or Web-Apps and 28.1% developed online Quran application that involved 
reading or reciting Quran using a Smart device. Approximately One third of the participants think the 
digital Quran which they have used in their applications is 100% authentic and is free of any 
intentional/unintentional errors, and 7.9% use no verification procedures during apps development. On the 
other hand, 42.9% are “Not sure” and 17.5% “Never thought about it”. However, 90.5% of the participants 
consider Content or Quranic Script Integrity/Authenticity as a Primary Measure while developing a Quran 
application. 

Modern image processing is one of the main technologies that can be used in designing applications and 
only 22.6% of the participants have used these tools to prevent any tampering attempts. The response rate 
to this question is very low given that fact that 51.6% responded with “No” and 25.5% “Never thought 
about this issue”. This question is answered by developers who may or may not been have designed Quran 
applications, and therefore the response rate may not be reflective of the actual developers who designed 
Quran apps before. The image processing tools used by the participants include MATLAB image 
processing Tool-Box (30.2%), open source tools (41.5%), proprietary tools such as Adobe (9.4%) and in 
house dedicated tools (18.9%). 

In response to the question “Do you follow any particular information security standards while 
developing applications?” 28.3 of the participants follow IS standards, 56.7% do not follow any standards, 
and 15% “Never thought about using any standards”. The list of standards used is provided in Table 10. 
The response to the “Other” option includes, mathematical creteria, HCI standards, Nist, rfc, dedicated in 
house standards. Following this, the survey asked if the developers use secure coding while developing 
such applications, 46.7% responded with “Yes”, 23.3% responded with “No” and 30% responded with “I 
never thought about secure coding. 


TABLE 10 

THE LIST OF SECURITY STANDARDS USED BY PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

ISO 9001 

50.0% 

13 

ISO 27001 

7.7% 

2 

PCI-DSS 

0.0% 

0 

FFIEC 

0.0% 

0 

Other (please specify) 

42.3% 

11 


The security of the digital Quran should not be compromised when designing Quran applications. 
Therefore, security and quality assurance standards are very important in designing software. The results 
show that only 4.9% of the participants have witnessed some digital Quranic apps with modified/forged 
due to unsecure coding or development and in response to “Do you follow Quality Assurance standards 
while developing such applications?” 35.6% say “Yes”, 40% say “No” and 23.7 say “I don’t Know”. The 
quality standards followed by the participants are given in Table 11. The “Other” option include, 
mathematical creteria, dedicated, and I use the Quranic text that is known to have been validated 
fromTanzil.org. Also, 86.7% of the participants expressed their interest to consider applying Security 
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Standards, Techniques and Measures to prevent any type of forgeries while developing Quran Applications. 
Also, 98.4% think it is necessary for a Quran authentication body to monitor and endorse the digital copies 
of the Qur’an worldwide and 95.1% prefer validating Quranic resources through a Quran Authentication 
Body. 


TABLE 11 

THE LIST OF QUALITY STANDARDS USED BY PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

ISO 9004 

51.9% 

14 

ISO 19011 

3.7% 

1 

AS9100 

0.0% 

0 

MBNQA 

3.7% 

1 

Lean/Six Sigma/TQM 

3.7% 

1 

McCall Quality Factors 

0.0% 

0 

Other (please specify) 

37.0% 

10 


Finally, the participants provided their vision, comments and recommendations of how new technologies 
could be used to develop and improve Quranic/Related Smartphone Apps, Web Apps or user interaction. 
Some of the participants comments and recommendations are: 

• Develop interactive apps. 

• Provide technical support for monitoring and addressing concerns from users. 

• Give more attention to information security and quality assurance. 

• Provide the Quran text as a service for developers in a way that they can only integrate it without 
having to modify. 

• Watermarking can be used to tackle the security issue. 

• Protect Quran applications by using information security algorithms. 

• Follow very rigorous information scrutiny and authenticity procedure for the content by 
performing intrusion testing to make the application secure. 

• Follow well known and secure programming methodologies. 

• Develop an intelligent Quran content checker. 

• Quranic Text and Translations must be verifiable by the user of an app. 

• Anti-tampering feature with the originality of Quran should be included in such application to 
alert the users. 

• Regulatory bodies (in collaboration with scholars) should be formed to monitor the development 
and distribution of Quran app and empowered in order to take action in case of a tampering case is 
detected. 

• Authentic Digital Quranic text should be made available to Quran digital developers. It should be 
in form that is flexible enough to be used in all platform. 

• Using mobile solution networks the authentic Quranic content can be disseminated among the 
social users of the communities. 


C. Graphic Designer 

The list of questions for the Graphic Designer category is provided in Table 12. This category received 
only 4 responses even though 9 participants chose this category as the one they belong to. The results show 
that 50% of the participants role involve designing graphics and/or user interfaces for Quranic applications. 
However, none use or apply secure design concepts while designing Quranic applications for sensitive 
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scripts such as Quran with one response (25%) was “never thought about this issue.” Also, 50% didn’t 
ever encounter any case in which the security of a digital Quran application was compromised due to 
unsecure design and 50% response was “never thought about this issue.” As a result of the above responses 
all participants agree on the following points: 

• They will consider and apply design security standards, techniques and measures to prevent any 
type of forgeries/tampering in the design of the Quran applications. 

• It is necessary for a Quran body authentication body to monitor and endorse the digital 
copies/publications of the Quran worldwide. 

• Prefer to validate Quran resources through a Quran authentication body. 

Finally, when the participants were asked about providing comments/suggestions/vision as graphics 
designers only one response was received which suggested that social media can be linked to digital apps to 
encourage communications between people from all over the world can talk and discuss things about the 
Quran, also, there should be an option to contact scholars within such applications. 


TABLE 12 

THE “GRAPHIC DESIGNER” CATEGORY QUESTIONS 


No. 

Question 

Response Count 

1 

Does your role involve designing Graphics or User Interface for Quranic applications? 

4 

2 

Do you apply Secure Design concepts while designing Quranic application for sensitive Quranic 
Scripts? 

4 

3 

Which Secure Design Concept do you use? (If Q.2 is Yes)* 

3 

4 

Did you ever encountered any case in which the security of a Digital Quran application was 
compromised due to unsecure design?* 

4 

5 

Will you consider and apply Design Security Standards, Techniques and Measures to prevent any 
type of forgeries/tampering in the design of Quran Applications? 

3 

6 

Do you think it is necessary for a Quran Authentication Body to monitor and endorse the digital 
copies/publications of the Qur’an Worldwide? 

4 

7 

Do you prefer validating your Quranic resources through a Quran Authentication Body? 

4 

8 

What is your vision/comments/recommendations of how new technologies would be used to develop 
and improve Quranic/Related Smartphone Apps, Web Apps or user interaction further? 

1 


* Question has an “Other” option or “comment box” 


D. Organization Manager/Director/Decision Makers/Owners/CEOs 

This list of questions for the Organization Manager/Director/Decision Makers/Owners/CEOs category is 
provided in Table 13. The results show that approximately 60% of the participants are currently managers, 
directors, CEOs or decision makers of companies/organizations. Table 14 shows the current managerial 
role of the participants. The highest response “Mangers” option received 32.1% followed by “Other” 
option 29.5%. The “other” option included the following responses: deans of colleges, deputy deans, head 
of department, supervisors, professors, project managers, IT technologist, lecturers, and salespersons. 
64.9% of the respondents are involved in Managing, Directing and Decision Making in Application 
Development. 75% of the companies work with smart devices and digital application development and 
only 15.3% of the companies develop and design digital Quran applications. 

The results show that 65.2% of the digital Quran content is obtained from online free resources, 13% 
from Islamic bodies, 7.2% from Islamic institutes and 14.5% of the participants are not sure. In addition, 
20.6% mentioned that they carry proof of authenticity for the Quran copies they develop without providing 
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any details, however, 79.4% do not have digital proof or never thought about Quran authentication. In 
response to the question “Does your company follow any particular security standards while developing 
these applications?” 34.3% responded “Yes” and 65.7% responded with “No” or “Never thought about 
using such standard.” Table 15 provides the security standards followed by these companies. The “Other” 
option included ISO9001 + CMMI for DEV. L2, and SSL. 

TABLE 13 


THE “ORGANIZATION MANAGER/DIRECTOR/DECISION MAKERS/OWNERS/CEOS” CATEGORY QUESTIONS 


No 

Question 

Response 

Count 

1 

Are you the current Manager/Director/CEO/Decision Maker/Owner of your company/Organization? 

76 

2 

What is your current Role?* 

61 

3 

Are you involved in Managing, Directing and Decision Making in Application Development? 

74 

4 

Does your company deal with Smart Devices/Digital Application Development? 

72 

5 

Does your company develop and design Digital Quran Applications? 

72 

6 

From where do you obtain the Digital Quran content? 

69 

7 

What Proof-of-Authenticity do you have that the Digital Quran Copy is/was authentic? 

68 

8 

Does your company follow any security standards while developing these applications? 

70 

9 

Which Security Standards does your company follows? (If the previous answer is YES)* 

56 

10 

Does your company follow Quality Assurance standards while developing such applications? 

67 

11 

Which Quality Standards does your company follow? (If the previous answer is YES)* 

50 

12 

Will you consider adopting Security Standards and Quality Assurance Standards while developing future 
digital applications? 

66 

13 

Do you think that it is necessary for a Quran Authentication Body to monitor and endorse the digital 
copies/publications of the Qur’an Worldwide? 

69 

14 

Will you prefer/recommend that your company will seek validating of Quranic resources through a Quran 
Authentication Body to obtain a Digital Quran Certificate? 

70 

15 

What is your vision/comments/recommendations of how new technologies would be used to develop and 
improve Quranic/Related Smartphone Apps, Web Apps or user interaction further? 

29 


* Question has an “Other” option or “comment box” 


TABLE 14 

THE CURRENT MANAGERIAL ROLE OF THE PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

Manager 

32.1% 

25 

Director 

21.8% 

17 

CEO 

9.0% 

7 

Decision Maker 

6.4% 

5 

Owner 

1.3% 

1 

Other please state) 

29.5% 

23 


TABLE 15 

THE SECURITY STANDARDS FOLLOWED BY THE COMPANIES OF THE PARTICIPANTS 


Answer Options 

Response Percent 

Response Count 

ISO 9001 

17.9% 

10 

ISO 27001 

3.6% 

2 

PCI-DSS 

0.0% 

0 

FFIEC 

0.0% 

0 

I don’t know 

71.4% 

40 

Other (please specify) 

7.1% 

4 


In response to the question “Does your company follow Quality Assurance standards while developing 
such applications?” 41.8% of the participants responded “Yes”, 25.4% responded “No” and 32.8% chose “I 
do not know”. Table 16 provides the Quality Assurance standards followed by these companies. The 
“Other” option includes: BRC & ISO 9001, company standards, and WISE (in house developed method). 
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TABLE 16 

THE QUALI TY ASSURANCE STANDARDS FOLLOWED BY THE COMPANIES OF THE PAR TICIPANTS 


Answer Options 

Response Percent 

Response Count 

ISO 9004 

14.0% 

7 

ISO 19011 

8.0% 

4 

AS9100 

0.0% 

0 

MBNQA 

0.0% 

0 

Lean/Six Sigma/TQM 

0.0% 

0 

McCall Quality Factors 

0.0% 

0 

I don’t know 

68.0% 

34 

Other (please specify) 

10.0% 

5 


“Will you consider adopting Security Standards and Quality Assurance Standards while developing 
future digital applications” 79.4% say “Yes”, 2.9% say “No” and 17.6% say “I don’t know”. A response 
with “I don’t know” implies that such participants are not technically oriented or they do make decisions in 
the company. 

“Do you think that it is necessary for a Quran Authentication Body to monitor and endorse the digital 
copies/publications of the Qur’an Worldwide?” 88.5 % responded Yes with high preference, 10.1 say “Not 
sure” and 1.4% say “No”. In response to “Will you prefer/recommend that your company seek validating of 
Quranic resources through a Quran Authentication Body to obtain a Digital Quran Certificate?” 84.3% 
responded with “Yes”, 12.9% “Not sure” and 2.9% “No”. From the results, it is very clear that many of the 
participants in this category are not familiar with security and Quality Assurance standards. 

Finally, the participants provided their vision/comments/recommendations of how new technologies 
would be used to develop and improve Quranic/Related Smartphone Apps, Web Apps or user interaction. 
The following list are the comments recorded by the participants, 

• Multilanguage text search and translations. 

• Validation of Quran from original source. 

• Interactive voice commands 

• Need to validate all Quranic information received from the Internet. 

• Use data and text mining techniques 

• Use NLP techniques as a supporting method in understanding the Holy Quran 

• Provide clear, concise and uniform information on Islamic and Quranic issues. 

• Authenticity is a must for Quranic applications. 

• Recommend a single Islamic institution to develop and implement steps of policies, strategies, 
standards, procedures, technologies, tools, applications and implementations on international level. 

• Support the development of an Islamic accreditation center which will have the authority to issue 
accreditation certificates for any Quranic/Islamic apps. 

• There is an urgent need for authentic Quranic apps for new Muslims especially in Europe. 

• Apply cloud computing and semantic web technologies as they are fertile fields for developing Arabic 
language and Quran apps, through research based on ontology relations and semantic fields in Arabic. 

• Develop a matrix of basic standards for quality control and accreditation methodology for the Holy 
Quran Information Systems (HQIS) based on the "Quality Benchmarks" and "Accreditation" 
benchmarks. Develop a set of Key Performance Indicators (KPIs), which include a set of key factors 
and sub-functions. One of the systematic means of testing is to evaluate the basic stages of the Quran 
software in all its details and components, in order to improve the software and integrate IT in the 
service of Quranic sciences so that these techniques are integrated into the (HQIS). So, that to qualify 
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the product to obtain a Quality Mark and to grant the manufacturer an Accreditation Certification, by a 
recognized donor, so that this matrix is the basis for anyone who wishes to produce software tools in 
the science of the Quran and is compatible with science and Sharia, whatever the form, type or 
structure of the program is. 

• To ensure systematic dissemination of contemporary techniques used in the sciences of the Holy Quran 
in all its branches and elements, it is necessary to establish a comprehensive system of quality unified 
and acceptable from all specialized centers to evaluate and adopt the quality of the software tools 
related to the Holy Quran. Thus, combining the efforts between the researchers/workers in the field of 
IT, specialists in the forensic Quranic studies, experts in the field of international standards and 
software standards in order to develop a matrix of basic criteria for quality control and accreditation 
methodology for the software tools related to Quranic science. KPIs are used as one of the systematic 
means to test, measure and evaluate all the basic phases of the Holy Quran software in all its details 
and components. Therefore, the performance indicators are a set of key factors and sub-functions, 
which ensure quality controls and accreditation methodology for the Quran science software tools to 
improve them and integrate IT in the service of Quranic sciences, so that these techniques are 
integrated and linked to a vital link. To evaluate the accuracy, excellence and efficiency of these 
software tools, the quality controls and the accreditation methodology for the companies producing the 
Quran-related software tools will aid in achieving the required level of quality and meet the established 
standards regarding the organizational structure, procedures, processes and resources required to 
implement a comprehensive quality management system. 


E. Islamic Institutes/Islamic Research Centers/Religious Body/IT or Other Research Centers 

The list of questions for the “Islamic Institutes/Islamic Research Centers/Religious Body/IT or Other 
Research Centers” category is provided in Table 17. 

TABLE 17 


THE “ISLAMIC INSTITUTES/ISLAMIC RESEARCH CENTERS/RELIGIOUS BODY...” CATEGORY QUESTIONS 


No. 

Question 

Response 

Count 

1 

Are you a representative of your institute/organization/center? 

20 

2 

Does your curriculum cover Quranic studies at a Department or College level?* 

17 

3 

Does your institute have Scientific Research related to Quran? 

20 

4 

Does your institute have Research Centers dedicated for Quranic studies and research? 

20 

5 

Does your institute address Digital Authentication and Certification issues in Quranic research and studies? 

20 

6 

Do you think that the Digital Quran copies/publications available online or on different devices are all 100% 
authentic? 

19 

7 

Based on the Quranic research and academic activities conducted in your institute, do you think that itis 
necessary for a Quran Authentication Body to monitor and endorse the digital copies/publications of the 
Qur’an Worldwide? 

20 

8 

Will you prefer that your institute will become a part of such initiatives to secure and validate the digital 
Quranic resources by forming a Digital Quran Authentication Body and issue a Digital Quran Certificate 
that confirms the authenticity of the Quran contents? 

20 

9 

What is your vision/comments/recommendations of how new technologies would be used to develop 
Quranic/Related Smartphone Apps, Web Apps or user interaction further? 

10 


* Question has an “Other” option or “comment box” 


This category received 20 responses, 85% of the participants are representative of their 
institutes/organizations however, 15% are representatives of non-Islamic institutes/organizations. 85% of 
these institutes/organizations provide Quranic studies curriculum at a college or department level and 15% 
at a high school or grade school level. Table 18 provides the responses to 5 of the questions shown in Table 
17. The results show that 65% of the research in these institutes is dedicated to Quran with 55% of these 
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institutes having special research centers for Quranic studies. However, even though most of these research 
institutes carry research on such sensitive script the issue of authentication is not given high importance, 
since only 25% of these institutes address the issue of digital authentication and certification. The 15.8% of 
the respondents believe that the digital copies of the Quran available online or on different devices are 
100% authentic. Therefore, 80% of the participants prefer that their institute will become a part of such 
initiatives to secure and validate the digital Quranic resources by forming a Digital Quran Authentication 
Body and issue a Digital Quran Certificate that confirms the authenticity of the Quran contents. 


TABLE 18 

THE RESPONSE TO SOME OF THE QUESTIONS IN THIS CATEGORY 


Question 

Yes 

No 

Not sure 

Does your institute have Scientific Research related to Quran? 

65% 

35% 

0% 

Does your institute have Research Centers dedicated for Quranic studies and 
research? 

55% 

45% 

0% 

Does your institute address Digital Authentication and Certification issues in 
Quranic research and studies? 

25% 

55% 

20% 

Do you think that the Digital Quran copies/publications available online or on 
different devices are all 100% authentic? 

15.8% 

21.1% 

63.% 

Will you prefer that your institute will become a part of such initiatives to secure 
and validate the digital Quranic resources by forming a Digital Quran 
Authentication Body and issue a Digital Quran Certificate that confirms the 
authenticity of the Quran contents? 

80% 

15% 

5% 


Finally, from the comments of the participants it can be concluded that: 

• Apps should provide certification and guidance to users. 

• Apps should be designed with age groups in mind. 

• Apps should be user-friendly 

• Authentication is a must. 

• Develop Authentication filters to check all information on web. 

• ISO certification. 

• Use such applications in teaching the Holy Quran. 


F. Holy Quran Publishers 

The list of questions for the Holy Quran Publishers is given in Table 19. It shows that the maximum 
number of responses recorded for the Holy Quran Publishers category is 4. The results show that all 
participants agreed on the following points: 

• Their companies publish and distribute Quran related digital content or printed copies in addition to 
other non-Quranic publications. 

• Publication of digital Quran in different media such as smart devices and online are effective and easy 
methods for users to read Quran. 

• Current digital copies of the Quran are free from any intentional or intentional errors. 

Other results from the responses received include: 

• 66.7% think that the digital copies of the Quran are all 100% authentic and the participant’s 
organizations can rely on its authenticity; however, 33.3% are not sure. 
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• In responding to the question “Will your publishing organization consider publishing digital copies of 
the Quran?” three responded to this question and each one chose a different response, these are: “Yes”, 
“No”, and “I do not know.” 

• Most of the participants (3 out of 4) are unaware if any authentication body is equipped and trained 
enough to conduct such authentication and validation of digital Quran/resources/publications on the 
other hand only 1 respondent agreed on this. 

• 75% of the 4 participants agree that it is a very high preference/requirement for digital copies of the 
Quran to be monitored and endorsed by a Quran authentication body and 25% thinks otherwise. 

• 66.7% (3 out of 4 participants) would prefer/recommend that their publishing company will consider 
applying for validating their Quranic resources through a Quran Authentication Body to obtain a 
Digital Quran Certificate that confirms the authenticity of the digital Quran contents. 


TABLE 19 

THE “HOLY QURAN PUBLISHERS” CATEGORY QUESTIONS. 


No. 

Question 

Response 

Count 

1 

Does your company deal with Publishing and Distributing the glorious Quran? (Digital or Printed) 

3 

2 

Does your company ONLY deals with Publishing and Distributing the printed and non-digital 
copies/publications of Quran? 

3 

3 

Is there any Authentication Body that deals in authenticating and validating the Quran before final 
publishing and further distributing?* 

3 

4 

Does your company consider that Digital Quran available in Smart Devices or Online provides another 
effective and easy method for users to read Quran? 

3 

5 

Do you think that the current digital copies of the Quran is free from any intentional or unintentional 
errors? 

3 

6 

Do you think that the digital copies of the Quran are all 100% authentic and your publishing organization 
can rely on its authenticity? 

3 

7 

Will your publishing organization consider publishing digital copies of the Quran? 

3 

8 

Do you think that any Authentication Body is equipped and trained enough to conduct such authentication 
and validation of digital Quran resources/publications? 

4 

9 

Do you think that it is necessary for a Quran Authentication Body to monitor and endorse the digital copies 
of the Qur’an Worldwide? 

4 

10 

Will you prefer/recommend that your publishing company will consider applying for validating their 

Quranic resources through a Quran Authentication Body to obtain a Digital Quran Certificate that confirms 
the authenticity of the digital Quran contents? 

3 

11 

What is your vision/comments/recommendations of how new technologies could be used to develop and 
improve Quranic/Related Smartphone Apps, Web Apps or user interaction further? 

1 


*Question has an “Other” option or “comment box” 


Finally, when asked for their comments, recommendations of how new technologies could be used to 
develop and improve Quranic related smartphone apps, or web apps only one participant suggested to apply 
Artificial Intelligence for understanding the Quran. 

IV. Recommendations 

Based on the comments, suggestions and recommendations provided by the participants in different 
categories surveyed, the main recommendations can be summarized as follows: 

• Establish an international Islamic accreditation center to be the only source for publishing the printed 
and digital copies of the Quran. Its role will also include training, monitoring, endorsing, and 
scrutinization. 

• Support the development of an Islamic accreditation center which will have the authority to issue 
accreditation certificates for any Quranic/Islamic center so that they can act as certification bodies 
under the umbrella of the main Islamic accreditation center. 
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• The International Islamic accreditation center must set a framework with clear milestones for 
developing standards the issuance of certifications for centers as well as for the design of web and 
mobile Quran apps. 

• This establishment/organization is also responsible to provide certificates of authorization to establish 
centers in different countries that follow strict rules in publishing Quran printed copies and issue 
certifications for apps. 

• Develop a matrix of basic standards for quality control and accreditation methodology for the Holy 
Quran Information Systems (HQIS) based on the "Quality Benchmarks" and "Accreditation" 
benchmarks. 

• Develop a set of Key Performance Indicators (KPIs), which include a set of key factors and sub- 
functions. One of the systematic means of testing is to evaluate the basic stages of the Quran software 
in all its details and components, in order to improve the software and integrate IT in the service of 
Quranic sciences so that these techniques are integrated into the (HQIS). So, that to qualify the 
product to obtain a Quality Mark and to grant the manufacturer an Accreditation Certification, by a 
recognized donor, so that this matrix is the basis for anyone who wishes to produce software tools in 
the science of the Quran and is compatible with science and Sharia, whatever the form, type or 
structure of the program is. 

• Apply cloud computing and semantic web technologies as they are fertile fields for developing Arabic 
language and Quran apps, through research based on ontology relations and semantic fields in Arabic. 

• Develop a secure cloud-based repository for Quran related resources and make it available as a web- 
service for all mobile and web apps. Any Quran related resources must be first authentic before being 
used in designing apps. 

• Provide in app capability for real time authentication. 

• Develop an agent program to keep monitoring the Quran text for correctness. Such text may be any 
uploads related to Quran with the help of popular social media and other websites that may contain 
such contents. 

In regard to features the participants would like to see in mobile and web apps, the following list 
provides some of the ideas presented by them: 

• Social media can be linked to digital apps to encourage communications between people from all over 
the world can talk and discuss things about the Quran. 

• Option to contact scholars within the app. 

• Develop interactive apps. 

• Provide in-app technical support for monitoring and addressing concerns from users. 

• Develop an intelligent Quran content checker. 

• Anti-tampering feature should be included in such applications to alert the users. 

• Offline accessibility. 

• Provide free of charge apps and ad free apps 

• Encapsulates different features in one app such as qibla direction, prayer time ... etc., 

• User friendly voice-driven apps. 

• Develop apps targeting people with special needs. 

V. Conclusion 

This study presented the results of a questionnaire survey distributed to 500 participants to investigate the 
aspects of security and authentication, certification and standardization of Quran related applications. The 
survey provided the feedback/perspective of six different categories of participants and this is the first of its 
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kind study/survey carried out in Quran computing to confront the issues of authentication, certification and 
standardization. The participants pointed out that there are few incidents that confirm the existence of 
errors in few digital versions of the Holy Qur'an that are deployed via the Internet and/or mobile apps and 
such intentional or unintentional forgery is not acceptable even if it is a small error such as a removal or 
addition of one letter or diacritic, thus this will deem the app or digital content distorted and invalid. 
Therefore, it can be concluded that the main concern is not the development of Quran apps with attractive 
features or cosmetic user interface. Certainly, the issue is the available and standardization of 
properly/accurately digitized Quran related content/resources that are well authenticated (by scholars) and 
widely distributed (by recognized bodies). Then the risks in terms of security aspect would be very much 
reduced and controlled. If this stage is reached then, it will be very efficient and easy to develop 
sophisticated and advanced apps that can be easily monitored and certified by recognized bodies. The latter 
is the quality standard that we need to primarily develop for Quran related apps, rather than any other 
standard that would check/improve the technical aspect of an app. Finally, security and authentication of 
digital Quran increases the confidence of the user in using Quran apps. 
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Abstract — Cloud computing is one of the important resources 
in IT industries to provide different services for cloud vendors 
and clients. Security is the main issue in cloud because data are 
stored and maintained by third party environment. Cloud 
computing has lot of issues to maintain transactional data in 
cloud databases. It needs to maintain ACID guaranties to execute 
the transactional data. In this paper, Proposed Architecture, 
Cloud controller, Depth 1 Fixed Tree Consistency (D1FTC) 
method, Cloud Data Locker and data storage in cloud have been 
discussed. It also portraits the needs of ACID guaranties, major 
security levels and ensure consistency of data transactions. 
Finally this paper affords thorough study on proposed 
architecture for improving security and consistency of data 
transactions in cloud database. 

Keywords — Cloud DTM, 2-Phase Commit Protocol, Cloud 
Security, Consistency in cloud DBS, Cloud Controller, D1FTC 
and Cloud Data Locker. 

I. Introduction 

Cloud computing provides everything as a service and it 
has different deployment models. It gives reliable service for 
analytical data but not for transactional data management [32]. 
Most of the cloud services are deployed in the hybrid cloud 
environment. The cloud vendors are making a service 
agreement with cloud providers and launch the reliable services 
for their clients. Instead of owning the software and hardware 
vendors move to cloud environment. So the cloud vendors 
need not worry about the software and hardware maintenances. 
Cloud has some technical issues to handle the data in 
distributed environment [10]. The cloud researchers are 
designing and developing new techniques for scalable 
transactions to avoid workload consistent[47,49], concurrency 
controls to improve serialization access [4, 8, 22], consistency 
approaches for efficient transaction processing 
system[24,25,26], scalable data storage model for better data 
handling[7,12,14], data security models[ 18,19,21 ] for efficient 
data handling and security algorithms for multilevel security. 
Therefore providing security is one of the common and major 
concerns in cloud, so multilevel security is needed to product 
each service without loss of data [17, 33, 48]. The next major 
issue is maintaining strong consistency state in the database 
level especially for transactional data management services 
[46]. Hence most of the research works are concentrating 
to develop the reliable methodology and architectures to 
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strengthen the security for cloud services and strengthen 
consistency state in cloud distributed database system [15, 16, 
30]. 

II. REVIEW OF LITERATURE 

R.Anandhi, et all proposed a model to improve the 
consistency state of the data transaction and also analyze the 
performance factors of different scalability options in cloud 
databases. This paper insists that reliability of cloud transaction 
applications depend on the consistency and scalability levels 
[13]. Chang Yao, et all proposed a concurrency control 
protocol named DGCC ((Dependency Graph based 
Concurrency Control). This protocol is used to achieve better 
execution of transaction and scalability among database 
systems. This DGCC based OLTP system also integrated with 
efficient recovery mechanisms [1]. Pompan Ampapom, et all 
explore a performance analysis against two different data 
consistency models by leading cloud providers. The results 
show that writes performance was 3 times worse than reads 
and it also has grater variance of consistency rate. Hence a 
better consistency approach is needed for data transactions in 
cloud environment [3]. Aleksey Burdakov, et all proposed a 
consistency model for NoSQL databases for data transactions. 
It explores the characteristics of data consistency and analysis 
performance of different consistency models [6]. Alvaro 
Garcia-Recuero, et all proposed a consistency model to 
efficiently replicate the data among long geographic distance in 
cloud environment. This approach secures the overloading of 
both network and system side. The architecture builds with 
three dimensional vector field models to handle different 
applications in cloud [5]. Jens Kohler, et all proposed a 
architecture called data cache architecture with implementation 
of both parallel and lazy fetch strategies. This work explores 
the performance analysis between the two strategies and 
discusses to overcome the issues with SeDiCo framework [7]. 
Thuy D. Nguyen, et all proposed a prototype called MLS 
column-store following kemelized design pattern. This 
approach used in cloud-scale data storage system. It explores 
the guaranties of efficient cloud-scale data storage in 
distributed system [17]. Sebastijan Stoja, et all proposed a 
architecture for realtime database in cloud data transactions. 
This paper explores the other important topics and analyzes the 
merits and limitations of it [38]. Marco Serafini, et all proposed 
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a dynamic data placement system for partitioned database 
management system. This system provides ACID guaranties 
for data transactions. The analyzed results are shown the server 
capacity and it used to improve the placement quality [38]. 

III. NEED OF ACID 

Ensure guarantee to maintain ACID properties in data 
transaction is global rule in IT sector. The service providers 
design efficient architectures to satisfy this issue [27]. 
Especially in cloud, data are stored in third part environment 
maintained by cloud vendors and cloud providers [34]. Hence 
maintaining ACID in cloud is not trivial [28], it is a complex 
work to implement efficient data transaction services in cloud. 
Many cloud providers offer their service for data transactions 
but they have their own merits and limitations [34,35]. 

The main idea of transaction is sequences of data read and 
write. 

TABLE I. ACID PROPERTIES 


A 

A transaction executed completely or 

It goes to the initial state. [All or Nothing] 

C 

Maintain Consistent state in database 

I 

The particular transaction will not affect other 
transactions 

D 

Ensure a transaction committed successfully 


When a transaction committed successfully in cloud 
environment, it should satisfy the ACID guaranties without 
loss of any details. The performance analyses depend on the 
ACID maintenances [36]. 

IV. EXISTING CONSISTENCY APPROACHES 

The frequently used recent consistency approaches in the 
distributed cloud environment are Classic approach, Quorum 
approach and Tree-Based approach. The through study of these 
three consistency approaches are discussed with its merits and 
limitations. 

a) Classic Approach: 

The classic approach maintains consistency state 
through synchronous replication in distributed environment. In 
this approach all nodes or database servers have participated in 
the writing operation [2]. Hence, it has low consistency rate 
and big execution time for each transaction in cloud 
environment [9, 23]. 

b) Quorum Approach: 

The Quorum approach is stronger than the classic 
approach that is frequently used in cloud to replicate the cloud 
databases. The so-called quorum method of voting is used for 
replication among the cluster of servers in cloud environment. 
In this approach, the majority of votes from all participated 
nodes or servers are confirmed for further execution. It shows 
more performance and gives high consistency assurance than 
the classic approach, but it slow down during the database 
execution because of frequent voting system [9, 23]. 


c) Tree-Based Approach: 

The Tree-Based consistency approach is the leading 
approach to maintain data consistency in the distributed data 
bases. It is formulated based on complex tree structure. Hence 
it provides varies degree of consistency rate depends on the 
replica server placed in the level of trees[2]. This approach 
introduces components in the cloud environment for the better 
execution of transactions [9]. The highest level of replica nodes 
in the tree provides high consistency assurance but it gives low 
consistency assurance moves to lowest level of replica nodes 
on the tree [9, 23]. 

V. SECURITY AT DIFFERENT LEVELS 

Multi level security is essential for distributed database 
system. Every level of protection is important to avoid data 
loss for efficient data service. More concerns are needed to 
provide data service in cloud because data are maintained by 
third party environment [37, 45]. So proper security models are 
may strengthen the cloud services [20,31]. 

Cloud services need security at following levels: 


TABLE II. V. SECURITY AT DIFFERENT LEVELS 


Level of Security 

Description 

Server access security 

It ensures the access control 
(Authentication, Authorization, 
and Auditing) to services in the 
cloud environment. 

Internet access security 

Connectivity and Open 
access manage in the public 
cloud. 

Infrastructure Security at the 
Network Level. 

Ensuring availability of the 
Internet facing resources of the 
public cloud used by the 
organization. 

Database access 

security 

Ensuring Access control for 
database and Key management 
for encrypting. 

Data privacy security 

Ensuring data confidentiality 
and integrity of the 

organizations data in transit to 
and from the public cloud 
provider. 

Program access 

Security 

Ensuring access control 
security for the programs of the 
client’s applications in the 
public cloud. 


VI. DATA TRANSACTION IN CLOUD 

a)ElasTraS: An Elastic Transactional Data Store in the Cloud: 

ElasTraS is designed for scalable and elastic data 
transaction in cloud databases[3 9]. It add components to 
achieve the elasticity in data storage during data transaction. 
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It uses two level hierarchy to maintain transaction guarantee 
and also make elastic scalability while increasing workload. 
ElasTraS has overcome the limitations of DDBMS with its 
database techniques for isolation and concurrency control. 

b) G-Store: A scalable Data Store for Transactional Multi key 
Access in the Cloud: 

G-Store is a scalable data store with multi key access in the 
cloud environment [40]. It is designed to achieve scalability, 
availability and fault-tolerance. The key group abstraction 
procedure is allowed to select any set of key group in the data 
store and provide scalable transactions. The atomicity and 
consistency guarantees are maintained by the single key to 
group of keys . 

c) Scalable Transactions for Web Applications in the Cloud: 

This approach supports scalable transactions in cloud 
environment [29] . It has transaction manager and many 
number of local transaction managers to handle the 
transactions. It maintains the ACID properties even in the 
server failures. The local transaction managers replicate the 
data and periodically checkpoint data snapshots to cloud data 
storage service [30]. 

d) EcStore: Towards Elastic Transactional Cloud Storage with 
Range Query support: 

The EcStore deployed among the cloud cluster to provide 
high elasticity with efficient range query to support cloud data 
transactions [41]. It achieves the features like load balancing, 
data partitioning, data replication and efficient range query for 
each transactional access. The distributed storage layer, 
replication layer and transaction manager layers are supported 
to handle data in cloud storage system. 

e) Dynamo: Amazon's Highly Available Key-Value Store: 


Dynamo is built for amazon to achieve high availability and 
scalability among the cloud clusters [42]. It takes a step to 
satisfy high availability, consistency, performance and cost- 
effectiveness. In this approach data is partitioned and replicated 
with consistent hashing and consistency is maintained by 
object versioning. The quorumlike technique is used to 
maintain consistency state among replicas during transaction 
updates with decentralized replica synchronization protocol. 

f) Megastore: Providing Scalable, High Available Storage for 
Interactive Services: 

Megastore is a highly available storage system for 
interactive services in cloud [43]. Most of the NoSQL storage 
system like Google’s Bigtable and HBase are fully support 
scalable but they have loose consistency model to maintain 
consistency state. It satisfy ACID properties over remote 
replicas with low latency for interactive applications. 

g) Sinfonia: a new paradigm for building scalable distributed 
systems: 

Sinfonia provides efficient and consistent access of data for 
mini data transactions to abstract the problem faced from 
concurrency and transaction failures [44]. It avoids the 
message passing protocols to minimize the complexity for the 
development process. The Sinfonia developers manipulate data 
centre infrastructure system like file system, lock manager and 
communication services. 

VII. PROPOSED ARCHITECTURE 

The proposed architecture mentioned in Fig improves 
security and consistency for data transactions in cloud 
environment. The components placed in this architecture are 
discovered after analysis of related works done by the field. 
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The further execution details are explored thoroughly in this 
section, main components are mentioned shout. 

a) User Authentication 

b) Cloud Service Controller 

c) Depth 1 Fixed Tree Consistency (D1FTC) Approach 

d) Cloud Data Locker 

a) User Authentication: 

The client sends the transaction request to 
authentication with username and password through internet. 
The authentication server acknowledges the request and sends 
the transaction ID with PIN template to the client. The client 
sends the PIN to authentication server for verification. The 
authentication server verifies whether it is a correct PIN for 
the specified account or not. After the proper verification 
process the request may enter to the service level. 

b) Cloud Service Controller 

The frame work for the cloud service controller 
makes clear the functionalities of it to enhance the service 
authentication. After the user authentication the client’s 
requests enter to the cloud service controller. It has two main 
components named service policy verification and cloud 
virtual machine monitor. 



Fig 2. Cloud Service Controller 
b.l) Service Policy Verification: 

The service policy verification verifies the SLA between cloud 
providers and cloud vendors. It ensures the Measurement, 
Condition Evaluation and Management services are not 
violated. After the service policy verification the request 
enters in to the virtual machine monitor. 
b.2) Cloud Virtual Machine Monitor: 

It has two different tools named VM Introspection 
and VMM Profile tools. Mainly these tools are occupied to 
verify the status of the each virtual machine in the data 
transaction management. 
b.2.1) VM Introspection Tool: 

Number of virtual machines are created and involved 
in the data transaction execution. The virtual machine 
introspection tool especially used to evaluate the virtual 
machines. It frequently verifies the virtual machines and 
ensures the virtual machine functionalities are good. This 


approach is very much useful to maintain the virtual machines 
for providing reliable service for data transaction clients. 

b. 2.2) VMM Profile Tool: 

Virtual Machine Manager (VMM) contains some of 
the specifications about the virtual machines. It has virtual 
machine profiles used to simplify the work to create templates. 
Templates are used to create virtual machines quickly with 
proper hardware and operating system settings. The VMM 
profile has different types of profiles like hardware profile, 
operating system profile and virtual hard disk profile. These 
profiles are used to create virtual machine from the created 
templates. 

c) Depth 1 Fixed Tree Consistency (D1FTC) Approach: 

The proposed D1FTC method is especially comfort with data 
transactions in cloud. It efficiently supports 2 phase commit 
protocol to execute each transaction without affect ACID 
properties even in the critical situations. The preliminary setup 
and methodology of the proposed D1FTC method is 
elaborately explained as follows. 

c. l) Functional Procedures: 

Preliminary setup: 

Step 1\ Generate an undirected graph for on hand virtual 
machines (nodes) 



Fig 3. undirected graph for cloud virtual machines with distances 

Step 2: Create possible Depth 1 Fixed Trees (D1FTC) 

Create adjacency list for the undirected graph and Refer it to 
find possible depth 1 fixed trees. The number of nodes in 
undirected graphs is considered as cloud virtual machines. 
Adjacency list for the graph is as follows: 

Node Possible 

Linked Nodes 

1 2 3 

2 13 4 

3 12 4 5 

4 2 3 5 

5 3 4 

Fig. 4. Adjacency List 

The adjacency list specifies the relationship between each 
node to other nodes in the graph. Hence it very much useful to 
create possible number of depth 1 fixed trees for the graph as 
follows: 
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Step 3: Find the distance between each nodes in D1FTC 
Create adjacency matrix for the undirected graph and refer it 
to find the distance between nodes in the tree and count the 
nodes in each trees. The adjacency matrix for the above 


mentioned undirected graph is as follows: 



1 

2 

3 

4 

5 

Mo. Of 

Distance between 

i 

a 

3 

10 

a 

a 

3 

13 

z 

3 

a 

2 

8 

a 

4 

13 

3 

10 

2 

a 

5 

4 

5 

21 

4 

a 

8 

5 

a 

7 

4 

20 

S 

a 

a 

4 

7 

a 

3 

11 




Fig. 6. Adjacency Matrix 



Step 4: Find the shortest path from all nodes 
The objective is to find the shortest path from each node to all 
other nodes in the graph. The Dijkstra’s Algorithm is suitable 
way to find the shortest path from one node to other nodes in 
the graph. 

Fig (a), desire to find shortest path from node 1. Edge values 
in the graph are weights and Node values in the tree are total 
weights. The transaction manager fixes one nearest node for 
starting position to find shortest path to connect all nodes. Fig 
(b) is the shortest path from node 1 and Fig (c) is created as 
fixed tree for shortest path. It works to update the data among 
all cloud virtual machines after successful commit of the cloud 
transaction. So the transaction manager replicated the data 
after successful commit with shortest path of fixed trees. 



D1FTC Method: 

The D1FTC method is ready for execution after the 
proper preliminary works. It handles the virtual machines 
efficiently to commit each transaction successfully. 

The components of D1FTC are as follows: 

1. User 

2. Transaction Manager 

3. Fixed Trees 

4. Transaction Coordinator 

5. T ransaction Participants 

6. Two-Phase commit protocol 

7. Shortest path Trees 

a) User: 

The user can submit request to the transaction manager and get 
response without loss of details. 

b) Transaction Manager: 

The transaction manager manages all transaction requests 
from different users and it fix a node for read query and fix a 
tree for update operations. It analyzes the query and fix the 
suitable tree with transaction coordinator and participants. 

c) Fixed Trees: 

The structures of these possible fixed trees are reliable to 
implement two-phase commit protocol, because it has one to 
many relationships that is one transaction coordinator and 
many transaction participants for all update transactions. A 
participant need not affect the other participants and it 
communicates only with the coordinator. So the transaction 
manager can choose any one of the node for transaction 
coordinator and linked nodes under the transaction coordinator 
are chosen as transaction participants. 
c.l) Adjacency Matrix: 

The transaction manager fixes any one of the tree that depends 
on the required nodes needed to execute a transaction. So the 
adjacency matrix calculates the number of nodes and the 
distance between nodes that is to simplify the work of 
transaction manager. 

d) Transaction Coordinator: 

Transaction Coordinator is responsible for the given 
transaction and it maintains all the participants in the selected 
fixed tree to commit a transaction. The two-phase commit 
protocol implements in the transaction coordinator. 

e) Transaction Participants: 

The transaction is divided into small no of process and it sends 
to transaction participants in the fixed tree. All participants are 
under supervised by the transaction coordinator. 

f) Two-Phase commit protocol: 

Two-Phase commit protocol is one of the efficient ways to 
execute data transactions in the distributed system. It can help 
successfully and execute the transactions with ACID 
guaranties in cloud environment. It is also very reliable for the 
proposed D1FTC method. 

g) Shortest path Trees: 

In cloud, data replicates from large geographic distance for 
every transaction and data may be lost during the replication 
process. The work of the finding shortest path is to connect all 
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virtual machines. It is to avoid the inconsistency database and 
minimize the replication time in cloud environment. 

c.2) Framework: 

The frame work for D1FTC approach is elaborately 
shown bellow. The step by step process executions are clearly 
defined and it leads to execute the data transaction and 
replicate the data in cloud distributed database system. 



Fig. 8. Frame work for D1FTC 

c.3) Pseudocode for D1FTC Approach : 

Step 1 : User request sends to transaction manager 
Step 2 : If (Read operation) 

{ 

Allocate any node to get updated data. 

} 

Else 

Step 2.1:Refer Adjacency Matrix 

Step 3 : Select the Depth 1 Fixed Tree [According to the 
transaction query,] 

Step 3.1:Fix the root node as transaction coordinator(TC) 

Step 3.2:Fix the child nodes are transaction participants(TP) 
Step 4 : Implement 2-phase commit protocol in the TC. 

TC divides the transaction and send it to Transaction 
Participants (TP) 

TC executes the transaction 
Step 5 : Refer Dijkstra’s Algorithm 

Step 6 : Select shortest path tree and Replicate/update the 
data. 

Step 7 : Successfully commit the transaction 


d) Cloud Data Locker 

In cloud, data are stored in the third party environment and it 
accessed by remote clients. The cloud providers offer database 
as a service with their own functionalities. The cloud vendors 
purchased the Infrastructure, platform, software and database 
and launch the services for their clients. Hence maintaining 
security in data storage level is not easy for cloud data 
transactions. The proposed cloud data locker model has three 
stage verification, 
d.l) Frame work: 

The first stage Cloud Data Locker (CDL) model 
verifies the user to send and get OTP through internet. The 
second stage it accesses Data Security Server and it verifies 
the data storage (providers) to send and get the Data Security 
Number (DSN). The third stage is enabled after the successful 
verification of both stage one and two. It accesses the Crypto 
Engine to handle the encrypted data stored in cloud data 
storage. The crypto engine decrypts the data for execution and 
encrypts the data after execution to store it in cloud data 
storage. So the provider does not know which data stored by 
the clients and the user, vendors are verified with OTP and 
DSN systems. Hence these three stage verification ensure the 
high data level security for cloud environment. 



d.2) Pseudocode for cloud data locker: 

Step 1: Send OTP to user 
Stepl.2: get OTP from user 
Step 1.3: Verify the user 

Step2: Send Data Security Number (DSN) to Cloud Data 
storage provider 

Step2.1: Data Security Server match the DSN in Data 
storage provider 

Step2.2: Verify the Cloud Data storage provider 
Step3: If (User & Data storage == Verified) 

Step3.1: Access the Crypto Engine and decrypt the 

data 

Step3.2: Send data to the transaction process 

Step3.3: Get the committed data 

Step3.4: Access the Crypto Engine and encrypt the 

data 

Step3.5: Store the data successfully. 
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VIII. FEATURES OF PROPOSED ARCHITECTURE 

a) Security for all levels 

The proposed architecture provides multilevel security for 
every data transactions. User authentication server, cloud 
controller, cloud data locker are constructed to strengthen the 
security in user, service and data storage levels. 

b) Ensure Service level agreement: 

The SLA between cloud providers and vendors are verified 
continuously to offer reliable service for clients of the vendors. 
The service policy verification makes sure the SLA is up-to- 
date. 

c) Efficient Consistency Method: 

D1FTC is a proposed consistency approach designed for 
data transaction management in cloud. The adjacency list and 
adjacency matrix are referred to construct depth 1 fixed trees. 
It measured the number of nodes and distance between nodes 
in a fixed tree. After every commit, data are replicated through 
the shortest path tree referred with Dijkstra’s Algorithm. 

d) Consistent structure for 2-phase commit protocol: 

The depth 1 fixed trees are structured specially to 
implement two-phase commit protocol. All fixed trees have 
one transaction coordinator and more than one immediate 
transaction participants. Hence the D1FTC structure is fully 
reliable for two-phase commit protocol. 

e) Enhanced Data Security: 

The proposed Cloud Data Locker model has three stage 
verification processes to enhance the data storage security in 
cloud. It verifies the user authentication with OTP system, 
verify transaction services with data security server and access 
the data through crypto engine. The data are stored in 
encrypted form in the cloud. The cloud providers and vendors 
cannot access or operate the client’s data. 

f) Guarantee Access Control: 

The proposed architecture ensures the access control in all 
levels. It grants proper implementation of user authentication 
server, cloud service controller and cloud data locker to 
manage access control in user, service and data storage level. 

g) Ensure ACID guaranties: 

The proposed D1FTC approach maintains the Consistency 
state for data transactions in cloud. The two-phase commit 
protocol is implemented in the fixed trees of D1FTC approach 
to maintain the Atomicity and Isolation properties. After the 
data transaction execution data are updated / replicated 
through the shortest path tree to maintain the Durability of 
every data transactions. So the proposed architecture ensures 
the ACID guaranties for data transactions in cloud 
environment. 

h) Efficient data replication: 

After every execution of transaction the data should update/ 
replicate with all servers in distributed clod database. The 
proposed D1FTC refer the Dijkstra’s Algorithm to find the 
shortest path tree from each node to all other nodes in cloud. 
The transaction can choose reliable shortest path tree to 
update / replicate the data efficiently with in the short span of 
time compare with other approaches. 

i) Trusted Transactions: 


The cloud service providers (vendors) and end users expect 
conviction for every transaction. The proposed D1FTC 
approach has transaction manager to track the transaction from 
beginning to go to commit state. The transaction coordinators 
in fixed trees are responsible for allotted transactions. If the 
system faces any failure, it goes to the initial state without loss 
of the data. 

j) Easy implementation: 

The implementation of proposed architecture is simple in 
the cloud environment. The virtual machines are fixed for 
transaction process with the reference of Adjacency list, 
adjacency matrix and Dijkstra’s Algorithm in the proposed 
D1FTC approach. The proposed cloud service controller has 
the VM introspection and VMM profile tolls to monitor the 
state of virtual machines and maintains them. The proposed 
cloud data locker has simple three stage methodology to 
provide high level of data security in the cloud storage. Hence 
the easy implementation of proposed architecture is reliable 
for cloud environment. 

k) Minimize the response time: 

The proposed D1FTC approach designed to implement the 
two-phase commit protocol to execute transactions faster than 
other approaches. It allocates the fixed trees depends on the 
weight of the transactions. It has the shortest path trees to 
replicate /update the data among cloud virtual machine. So this 
efficient methodology may minimize the response time for 
data transactions in cloud. 

IX. CONCLUSION AND FEATURE WORK 
Cloud provides reliable services to handle analytical data but it 
faces security and consistency issues when it is offering 
transactional data management like banking, online 
reservation and shopping cart, etc. In this paper, the proposed 
architecture ensures the security in user, service and data 
storage level with efficient cloud controller and cloud data 
locker model. The proposed D1FTC approach minimizes the 
execution time among the virtual machines scatter in the 
distributed cloud environment. Hence the proposed 
architecture designed to ensure service level agreement, 
efficient transaction processing approach, ensure ACID 
properties, enhanced data storage security for trusted data 
transactions in cloud environment. The feature work is to 
implement the proposed architecture in real time cloud 
environment and do the performance analyze with existing 
techniques. 
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Color Histogram with Curvelet and CEDD for 
Content-Based Image Retrieval 

Heba A. Elnemr, Department of Computers and Systems, Electronics Research Institute 

Abstract- Content-Based Image Retrieval (CBIR) is one of the most vigorous research areas in the field of pattern 
recognition and computer vision over the past few years. The accessibility and progressive development of visual and 
multimedia data, as well as the evolution of the internet, emphasize the necessity to develop retrieval systems that are 
capable of dealing with a large collection of databases. Many visual features have been explored, and it is virtually 
observed that implementing one kind of features is not efficient in retrieving different types of images. Therefore, in 
this paper, the author proposes an efficient image retrieval technique that joins color and texture features. The curvelet 
descriptors that are obtained by using wrapping based discrete curvelet transform are used as texture features. While 
color features are extracted using quantized RGB color histogram (QCH). Besides, color edge directivity descriptor 
(CEDD), which joins color and texture features in one histogram is obtained. A multiclass SVM is applied to classify 
the query images. Four datasets (ALOI, COIL-100, Wang, and Corel-1000) are used to test and assess the proposed 
system. Improved retrieval results are obtained over CBIR systems based on curvelet descriptors and CEDD 
individually and jointly. Furthermore, comprehensive experiments have been performed to select the number of 
histogram bins that achieves an effective and efficient image retrieval. The obtained average precision for the ALOI, 
COIL-100, Wang and Corel-1000 datasets are 0.996,998,0.898 and 0.964, respectively. Also, comparisons with several 
state-of-the-arts demonstrate the effectiveness of the proposed system in refining the retrieval performance. 

I. Introduction 

Content-based image retrieval (CBIR) is a technique that automatically searches for visually similar images 
from large scale image databases according to users' requirements. The image retrieval systems based on visual 
image content have become the center of attention of researchers for more than a decade. The CBIR technique is 
carried out through two steps; feature extraction and matching policy. In the first step, which considered the most 
challenging step in CBIR, effective features of each sample image are analyzed and extracted. Most existing 
general purpose CBIR techniques implement low-level features, such as color, texture, and shape. The set of 
extracted features is used to build the image signature. While in the second step, the image signatures obtained 
from the images in the database are compared with that extracted from a query image by a pre-instituted similarity 
measurement procedure, so that top relevant images in the image database can be returned as the retrieved images 
[ 1 ]. 

Color content information is one of the most extensively and popularly implemented visual features in CBIR 
systems. Color feature is comparatively the simplest and most straightforward visual feature for image retrieval. 
It is also capable of separating images from each other, relatively robust to background complexity and invariant 
to image size and orientation [2, 3]. 

The texture is also one of the most commonly utilized low-level visual features in CBIR. Texture features 
provide spatial and relational information on the intensity distribution over the image [1,3]. 

The shape of objects is frequently used as an effective feature for image retrieval because human visual 
perception can recognize a scene based only on objects shape. Mainly, shape features contain semantic 
information about an object, thus, the shape description and representation is a very difficult task [3, 4]. 

During the past years, various techniques have been developed for extracting effective and efficient features. 
Generally, many developed CBIR techniques are based on extracting a single type of features [4]. However, 
acceptable retrieval results are hard to be obtained using a single feature type because an image normally 
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comprises various visual characteristics. Consequently, it is mandatory to combine various feature types in such 
a way as to enhance and emphasize the quality and efficiency of the extracted features so as to obtain an acceptable 
retrieval performance [5]. 

This work aims mainly to construct an efficacious CBIR system that is capable of handling large datasets 
expeditiously. The proposed CBIR system integrates color and texture features through quantized RGB color 
histogram (QCH), color edge directivity descriptor (CEDD) and curvelet transform. 

RGB color model is the most commonly used color model in CBIR, also a color histogram is the simplest and 
most widespread color feature utilized for image retrieval. QCH denotes the procedure of reducing the number of 
histogram bins by gathering similar colors into the same bin. Thus, the QCH has a relatively low computational 
cost, as well as it is invariant to rotation, translation, and scale. However, QCH does not take into consideration 
spatial distribution or description of color information, this is deemed to be the main weakness of this method. 
Also, different images could have similar histograms besides minor variation in color due to variations in 
luminance could produce a considerable change in histogram [6]. Therefore, the author combines QCH with 
CEDD and curvelet transform descriptors so as to attain a good performance. 

CEDD is a powerful low-level feature that joins color and texture information in a single histogram as well as 
it has comparatively low computational cost, therefore, it is appropriate to be used in huge image databases [7]. 

On the other hand, curvelet has been widely adopted for image denoising, character recognition, image 
segmentation, texture analysis as well as image retrieval and it has shown an encouraging performance [3, 8-12]. 
As curvelet captures more directional features, besides it grabs more accurate texture and directional information 
and it outperforms wavelet and Gabor filter [3, 12], the author has implemented curvelet descriptors for her 
proposed retrieval system. Multiclass SVM model is used to accomplish the classification task. 

Finally, the author compared the proposed system with others based on CEDD alone, curvelet alone, and CEDD 
integrated with curvelet. Furthermore, several investigations have been made to choose the appropriate number 
of histogram bins to attain an efficient and effective retrieval performance. These features prove to be 
complementary to each other with promising rendering. 

The primary contributions of this work can be summarized as follows, (i) The suggested method extracts color 
and texture information using CEDD and combines it with texture and directional information that are extracted 
using curvelet transform descriptors, (ii) The proposed CBIR performance is improved using QCH and various 
experiments are conducted to determine the best number of bins to achieve the best performance, (iii) The 
proposed CBIR system performance is examined on several forms of large databases including natural, real world 
and well defined object images. 

The paper is organized as follows: Section 2 presents the related work. Section 3 describes the methodology 
utilized to build the proposed CBIR system while section 4 portrays the depiction of the empirical setup embracing 
the datasets and the experimental results. Conclusion and future work are discussed in section 5. 

II. Related work 

Latterly, researchers have proposed many approaches for CBIR using different features such as color, shape, 
and texture. In this section, the author will discuss several recent literatures that covers some key aspects of CBIR 
technique. 

The work of [12] applied discrete curvelet transform on the Brodatz texture images dataset, then low order 
statistics is computed from the transformed images. Euclidean distance carried out the similarity measurement in 
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the suggested CBIR scheme. The results manifest that the proposed curvelet texture feature descriptor performs 
better than that of Gabor filters in both retrieval efficiency and accuracy. 

In [13] a retrieval system that uses local feature descriptors; SIFT and SURF, to generate image signatures 
invariant to scale and rotation is proposed. Then, BOVW model is created by clustering the local descriptors using 
K-means technique to build the vocabulary of the K clusters. Finally, the retrieval is accomplished using the SVM 
classifier model. 

Ali et al.[14] designed an image representation scheme based on the histograms of triangles. The proposed 
method aimed to add spatial information to the inverted index of Bag-of-Features representation. Histograms of 
triangles are carried out through two levels that are evaluated separately. In the first level, the image is divided 
into two triangles, while in the second level the image is divided into four triangles. Three different classifiers, 
Radial Basis Function Neural Networks (RBF-NN), SVM and Deep Belief Networks (DBN), are applied and the 
overall system performance is evaluated. 

Authors of [15] presented an experimental study to investigate and analyze the effect of joining four sampling 
strategies (SIFT, SURF, Random patch generator and Gauss Random patch generator), with four global feature 
descriptors (MPEG-7 (Color Layout Descriptor (CLD), Edge Histogram Descriptor (EHD) and Scalable Color 
Descriptor (SCD)) as well as MPEG-7-like (CEDD), in a Bag-Of-Visual-Word (BOVW) structure. The conducted 
results demonstrate that the retrieval performance of the proposed descriptors outperforms their performance in 
their original global form. Besides, they perform better than ordinary SIFT and SURF-based approaches and 
perform comparably or better against some recent methods. 

Malik and Baharudin [16] suggested a CBIR technique that is based on extracting quantized histogram statistical 
texture features in the compressed domain. The grayscale image is divided into non-overlapping blocks. Next, 
each block is transformed into a DCT block in the frequency domain. The similarity measurement is achieved 
using seven distance metrics. The experimental results demonstrate that the Euclidean distance has better 
performance in both computation and retrieval efficiency. 

In the work of [17], a CBIR system with texture and color features succeeded by ant colony optimization feature 
selection technique is proposed. Wavelet transformation of the sample images is computed and the low-frequency 
components are used as texture features. Dominant color descriptor, color statistic features, and color histogram 
features are extracted, in both RGB and HSV domains, as color features. For each feature type, a suitable similarity 
measure is presented. Ant colony optimization technique is implemented to select the most significant features, 
among the entire extracted features. 

Walia and Pal [18] submitted an image retrieval framework based on combining low-level features. The Color 
Difference Histogram (CDH) is implemented to extract color and texture features, while Angular Radial 
Transform (ART) is used to extract shape features. The CDH algorithm is modified in order to improve the overall 
system performance. 

In the work of [19], a CBIR system is designed by combining SURF descriptors with color moments. The 
similarity strategy is carried out by the KD-tree with the Best Bin First (BBF) search algorithm. Voting Scheme 
algorithm is finally used to classify and retrieve the matched images from the dataset. 

Mukherjee et al. [20] proposed a CBIR system relied on assigning a model of visual words to represent an 
image patch. Each image patch is represented by a vector that denotes the affinities of the patch for the most 
significant visual words. To improve the retrieval performance a dissimilarity measure among the pair of image 
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patches is introduced. The dissimilarity measure is made up of two terms: The first one depicts the variation in 
affinities of the patches that belong to a common set of significant visual words, while the second term penalizes 
the measure according to the number of visual words that affects only one of the two patches. 

III. Methodology 

This paper demonstrates a new CBIR system that relies on extracting texture features as well as color features. 
The system uses curvelet transform to obtain the spectral domain coefficients that are utilized to compute the 
texture descriptor of that image while color features are extracted using QCH. Furthermore, CEDD is employed 
to obtain both color and texture information in a sole histogram. Fig. 1 displays the block diagram of the proposed 
CBIR system. 

A. Quantized RGB Color Histogram 

The color histogram is a good portrayal method for describing the color content of an image, it can be obtained 
by counting the number of occurrences of each color in an image. Pixels in an image are described by three 
components (typically but not necessarily) in a certain color space, consequently, each pixel is represented as a 
tuple of three numbers. The RGB color space is the most popular color space used for computer graphics. RGB 
color histogram investigates each of the RGB-channels separately, this leads to a huge length of the histogram 
vector (= 256 * 3 for 8-bit RGB image). Thus, color quantization has to be applied in order to produce 3D-color 
histogram, which is suitable for building efficient indexes for large image databases as well as has an acceptable 
computational cost. In color quantization procedure, the number of colors used to represent an image is reduced, 
and each color component is quantized into a number of ‘bins’. Since color components (R, G, and B) are equally 
important, each component is quantized into the same number of bins. In this work, the QCH is implemented and 
a various number of bins are tested in order to find the best quantization level. 

B. CEDD histogram 

CEDD is a low-level feature that incorporates color and texture information in a single histogram. One of the 
greatest significant characteristic of CEDD, the low computational power required for its extraction in comparison 
with that needed for most MPEG-7 descriptors [7, 21]. CEDD is linked with a texture unit to extract textural 
information besides a color unit to extract color information. 
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Images 


Figure 1. A block diagram of the proposed CBIR system. 
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In order to obtain the CEDD histogram, each image is divided into 1600 image blocks, and then each block is 
processed through the following algorithms. In the texture unit, the image block is parted into 4 regions, sub¬ 
blocks, the mean value of the luminosity of the pixels enclosed within each sub-block is considered to be the value 
of this sub-block. The luminosity values are computed within the YIQ color space. Afterward, each block is 
filtered utilizing 5 digital filters that were suggested in the MPEG-7 Edge Histogram Descriptor [7, 21]. 

On the other hand, in the color unit, a set of fuzzy rules are implemented to obtain the color information. Each 
image block is transferred into HSV color space. Next, Fuzzy-Linking histogram technique [22] is applied. The 
fuzzy system creates a 24-bins histogram. 

The CEDD structure incorporates six texture regions and each of which holds 24 color regions. Accordingly, 
the final histogram consists of 6 x 24 = 144 bins, conforming to the total 144 color regions within the six texture 
regions. Finally, the histogram is normalized and quantized so as each bin is represented by 3 bits. Hence, its total 
size is limited to be 144 x 3 = 432 bits or 54 bytes per image. This small size of the descriptor is considered to be 
another advantage of CEDD. 

C. Curvelet Transform 

Curvelet transform is a multi-scale transform designed to represent images at various scales and various angles, 
this transform is established by Donoho and Duncan in 1999 [23]. In this transform, the input image is initially 
decomposed into a set of subbands and each subband is partitioned into blocks that are analyzed by ridgelet 
transform. The ridgelet transform is realized using the Radon transform and the 1-D wavelet transform [24]. To 
avoid blocking effect, the spatial partitioning process involves overlapping of windows that lead to a large amount 
of redundancy. As well, it is very time-consuming, which makes it inappropriate for texture feature analysis in a 
large database. Therefore, Candes et al. [25] developed fast discrete curvelet transform that is based on the 
wrapping of Fourier samples. This transform is simpler, faster, less redundant, and has less computational 
complexity since it applies fast Fourier transform instead of the complex ridgelet transform. 

For a 2-D input image of size M x N, the curvelet transform based on the wrapping of Fourier samples generates 
a set of curvelet coefficients indexed by a scale j, an orientation 1, and two spatial location parameters (kl, k2). 
These coefficients are defined as follows [25] 

Cji(k h k 2 )= Z f(m,n) (p D (m,n) ^ 

0<m<M J’1^2 

0<n<N 

where f(m, n) is the Cartesian array of the input image and (p^ ^ ^ (m, n ) is a digital Curvelet waveform. These 
coefficients are then used to form the curvelet texture descriptors by implementing statistical operations. 

Curvelet Texture Features Extraction 

After the curvelet coefficients in each sub-bands are created and stored, curvelet statistical features, i.e. mean 
and standard deviation, of the coefficients corresponding to each sub-bands, are computed. These features proved 
to be capable of describing curvelet sub-bands [12, 26]. In general, these mean and standard deviation are then 
used as texture descriptors of the image. Hence, for each curvelet sub-band, the author obtains two texture features. 
If n curvelets sub-bands are applied for the transform, 2n dimensional texture feature vector is obtained for each 
image. 
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Table 1 

CURVELET TRANSFORM (4 LEVEL DECOMPOSITION) 


Scale 

1 

2 

3 

4 

No. sub-band 

1 

16 

32 

1 

Sub-bands considered for feature calculation 

1 

8 

16 

1 


In this work, 4 level curvelet transform is implemented to decompose the input images. Based on this analysis, 
50 (=1+16+32+1) curvelet coefficients are computed. However, [12] presents that curvelet at an angle 9 generates 
the same coefficients as curvelet at an angle (0 + n). Thus, due to this symmetry, half of the sub-bands at scale 2 
and 3 may be rejected. Accordingly, 26 (=1+8+16+1) sub-bands are maintained, producing a 52 dimension feature 
vector for each image in the database. Table 1 illustrates the sub-bands distribution at each transform level. 

The mean of a sub-band at scale s and orientation 0 is stated as: 


1 M N 

Bs,Q = T7 77 2 Z Cs 9 o(i,j) 

M xN /-i j= i 

while the standard deviation of a sub-band at scale s and orientation 0 is expressed as: 


( 2 ) 



where Cs,0(i, j) represents the curvelet coefficient at scale s, orientation 0 and location (i, j). 


IV. Results and Analysis 

A. Image Datasets 

The proposed system was assessed using four different standard datasets; Amsterdam Library of Object Images 
(ALOI) dataset [27], Columbia object image library (COIL-100) dataset [28] and two subsets of the Corel image 
database [29]. 

ALOI image dataset provides one-thousand small objects recorded under various imaging conditions (viewing 
angle, illumination angle, and illumination color). More than a hundred images of each object were registered. In 
this work, 102 objects are randomly selected with 72 different viewpoints each. COIL-100 is a well-known 
standard color image database that includes 100 objects with 72 views acquired by rotating each object around 
the vertical axis. Two subsets of Corel database are utilized, each of which consists of 10 irrelevant classes and 
each contains 100 images of the Corel stock photo database. The first subset is the Wang database that contains: 
Africa, Beach, Building, Bus, Dinosaur, Elephant, Flower, Horse, Mountain and Food classes. While the second 
one is Corel-1000 that comprises; Dinosaur, Cyber, Horse, Bonsai, Texture, Fitness, Dishes, Antiques, Elephant 
and Easter egg groups. Fig. 2, Fig. 3, Fig. 4 and Fig. 5 illustrate samples of the utilized datasets. 

These databases are selected to represent two different types of CBIR chores: The ALOI and COIL-100 datasets 
characterize retrieval task that involves depicting more clearly defined objects with various viewing angles while 
Wang and Corel-1000 databases signify retrieval task with real world arbitrary photographs. 

B. Implementation Details 

All assessments were accomplished on a Lenovo laptop with Intel Core i7, 2.20 GHz processor, 8GB RAM, 
and Windows 10 Home Ultimate 64-bit as an operating system. The system was implemented in Matlab R2013b. 
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In this work, two validation techniques are utilized, viz. repeated holdout validation and k-fold cross-validation. 
The repeated holdout method splits randomly the data into two disjoint subsets; training set and test set and repeats 
this process with different subsamples. On the other hand, the k-fold validation generates a k subsets of equal size. 
The system is trained with k - 1 subsets and the remaining one forms the test set. This procedure is repeated k 
times. The holdout is simpler and needs less computation, however, there are overlapping test sets. On the 
contrary, k-fold cross-validation has the advantage that there are non-overlapping test sets. All samples in the 
dataset are ultimately used for both training and testing. However, it is computationally expensive. 

Holdout validation is achieved by randomly choosing 90% of a dataset images for training and the remaining 
10% of the images are reserved for testing. This validation procedure is repeated five times and the average 
performance is computed. Besides, ten-fold cross-validation is employed to evaluate the system performance. 

To evaluate the proposed retrieval system, 4-level curvelet transform is applied. Besides, different quantization 
levels are used to compute the color histogram (CH) so as to obtain the optimum number of bins. The tests are 
carried out on 5 quantization levels; 9, 16, 25, 64 and 100 bins for each of the RGB channels. 



Figure 2. Sample images from ALOI database. 



Figure 3 . Sample images from COIL-100 database. 




a. Dinosaurs 



e. Texture 



i. Antiques 



j. Elephants 


Figure 4. Sample images from Corel-1000 database. 
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i. Mountain j. Food 

Figure 5. Sample images from Wang database. 


Recall and precision metrics are utilized to measure the performance of the proposed CBIR system. The two 
metrics are defined as follows: 

^ . . Number of relevant images retrieved 

Precision =----- (4) 

Total number of images retrieved 

^ TT Number of relevant images retrieval 

Re call = ----- (5) 

Total number of relevant images in the database 

Furthermore, Precision-Recall curve (PRC) is used to assess the effectiveness of the proposed image retrieval 
system. 

C. Experimental results and discussion 

In this section, the author presents and debates the results of the experimental evaluation of the proposed 
retrieval system. To estimate the efficiency of all implemented features (curvelet descriptors, quantized RGB CH 
(QCH) and CEDD) on the given databases, the author extracted these features from the images and executed 
experiments to test the effect of curvelet descriptors and CEDD features individually, the combination of both as 
well as the combination of them with the QCH. In addition, the author scrutinized the effect of using different 
quantization levels (five quantization levels are studied; 9, 16, 25, 64,100). For all experiments, the author reports 
the average precision and recall ratios for the holdout and K-fold validation methods. Moreover, the results are 
benchmarked with previous works that use the same databases utilized in this work. Also, the author has compared 
the results of her previous work [30] with that of this work. 

In this work, each image is represented by a feature vector of size 
feature vector size = N curvelet + N CEDD + N CH 

( 6 ) 

N C H=ix 3 

where N CU rveiet is the dimension of curvelet descriptor (52 for 4-level decomposition), Ncedd is the dimension of 
CEDD vector (144), Nch is the total number of bins of the 3D-color histogram and i is the number of bins for each 
RGB channel (since RGB channels are equally quantized). Table 2 depicts the dimensions of the extracted feature 
vectors for the different realized methods. 

Fig. 6 and Fig. 7 represent the experiments conducted on ALOI dataset using different retrieval methods. The 
results indicate that joining CEDD and curvelet improves the retrieval system significantly and combining them 
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with color histogram descriptors further enhances the retrieval performance. The optimal precision using K-fold 
cross-validation (0.995) and using holdout validation (0.996) is attained when the quantization level is 16 and 100 
for each RGB channel, whereas it is near optimal at quantization levels 9, 25 and 64, with tiny variations. While 
the best recall (0.994) is obtained with 25 quantization levels using holdout validation as well as 100 quantization 
levels using K-fold validation. The proposed system achieved a good performance during both K-fold and holdout 
validations. This demonstrates the efficiency of the proposed retrieval system. 

Fig. 8 and Fig. 9 summarize the experimental results of all implemented descriptors on the COIL-100 dataset. 
The results reported in Fig. 8 indicate that the best precision ratio (0.998) is reached by integrating CEDD, curvelet 
and RGB color histogram with 25 bins using holdout validation and with 9 and 25 bins using K-fold validation. 
As well, it is nearby to optimum (0.997), with an insignificant difference, when using 16 bins for both methods 
of validation. Additionally, Fig. 9 illustrate that the best recall ratio (0.998) is attained by joining CEDD, curvelet 
and RGB color histogram with 9 and 25 bins employing K-fold validation, and it is near to optimal (0.997) using 
16, 25 and 64 bins using holdout validation as well as 64 bins using K-fold validation. It is also worth to be noted 
that recall close to optimum (0.997) when joining CEDD and curvelet descriptors for the holdout validation. 

Fig. 10 and Fig. 11 report the results obtained by employing the proposed retrieval techniques on the Corel- 
1000 dataset. The author realized from these figures that integrating CEDD and curvelet descriptors enhances the 
precision ratio for both validation methods, the precision reaches its maximum value (0.954) using K-fold 
validation. Moreover, merging these descriptors with RGB color histogram with 25 bins further improves the 
precision value (0.964), and it is close to optimal (0.955) with 16 bins using holdout validation as well as (0.951) 
with 9 bins using K-fold validation. Also, utilizing 16 and 25 bins yield to almost the best recall ratio (0.948 and 
0.943 using K-fold and holdout validation, respectively). 

Fig. 12 and Fig. 13 present the comparison results of the Wang dataset using the proposed retrieval methods. 
From the results, the author recognized that the optimal precision ratio is reached utilizing CEDD, curvelet and 
RGB color histogram with 16 bins (0.898) using holdout validation and 9 bins (0.892) using K-fold validation 
with a tiny difference (0.006). Furthermore, recall ratio achieves the best results when employing CEDD, curvelet 
and RGB color histogram with 16 bins (0.874 using holdout validation) and 9 bins (0.877 using K-fold validation). 

It can be noticed from the results that CBIR using the CEDD has a better performance than using curvelet this 
is because CEDD has color and texture features. Moreover, joining both descriptors enhances the performance 
significantly. Besides, integrating QCH descriptor improves the retrieval efficiency further and this improvement 
varies as the quantization levels vary. The author can also perceive that the use of greater quantization levels does 
not necessarily lead to a better precision. Contrarily, it leads to a much less efficient search. In CBIR, the retrieval 
effectiveness is essential than slight precision gain. Small retrieval performance enhancement at the cost of much 
higher dimension will reduce the entire system efficiency. Thus, for retrieval efficiency, the author recommends 
that CEDD, curvelet and RGB color histogram obtained from either 9, 16 or 25 quantization levels are appropriate 
to accomplish the retrieval tasks for all stated datasets. 


135 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


Table 2 

The feature vector size for the implemented retrieval methods 


Method 

Feature vector size 

CEDD 

144 

Curvelet 

52 

CEDD + Curvelet 

196 

CEDD + Curvelet + QCH with 9 bins 

223 

CEDD + Curvelet + QCH with 16 bins 

244 

CEDD + Curvelet + QCH with 25 bins 

271 

CEDD + Curvelet + QCH with 64 bins 

388 

CEDD + Curvelet + QCH with 100 bins 

496 


■ Holdout validation 
Curvelet+CEDD+QCH (100 bins) 

Curvelet+CEDD+QCH (64 bins) 

Curvelet+CEDD+QCH (25 bins) 

Curvelet+CEDD+QCH (16 bins) 

Curvelet+CEDD+QCH (9 bins) 

Curvelet+CEDD 

CEDD 

Curvelet 

0.4 


k-fold cross-validation 



0.6 


0.8 


Precision 

Figure 6. Average precision for different methods on ALOI dataset 


0.996 

0.995 

0.994 

0.995 

0.995 

0.994 

0.996 

0.995 

0.992 

0.994 

0.984 

0.983 


0.891 

0.895 


■ Holdout validation ■ K-fold cross-validation 


Curvelet+CEDD+QCH (100 bins) 
Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 




0.4 0.6 0.8 1 

Recall 

Figure 7. Average recall for different methods on ALOI dataset 
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■ Holdout validation ■ K-fold cross-validation 


Curvelet+CEDD+QCH (100 bins) 
Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 



0.4 


0.6 0.8 

Precision 


Figure 8. Average precision for different methods on COIL-100 dataset 


1 


■ Holdout validation ■ K-fold cross-validation 


Curvelet+CEDD+QCH (100 bins) 
Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 



0.4 


0.6 


Recall 


0.8 


1 


Figure 9. Average recall for different methods on COIL-100 dataset 
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■ Holdout validation 
Curvelet+CEDD+QCH (100 bins) 

Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 

0.4 


i K-fold cross-validation 



0.800 

0.794 


0.6 


0.8 


Precision 

Figure 10. Average precision for different methods on Corel-1000 dataset 


0.951 

0.941 

0.951 

0.943 

_ .964 
0.946 

0.955 

0.950 

0.951 

0.951 

.927 

0.954 


0.900 

0.911 


■ Holdout validation ■ K-fold cross-validation 


Curvelet+CEDD+QCH (100 bins) 
Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 


0.918 

0.895 

0.922 

0.919 

0.943 

0.936 

0.933 

0.948 

0.940 

0.944 

0.919 

0.943 

0.903 

0.910 

0.784 

0.773 


0.4 


06 Recall 08 


1 


Figure 11. Average recall for different methods on Corel-1000 dataset 
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■ Holdout validation 
Curvelet+CEDD+QCH (100 bins) 
Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 

0.4 


i K-fold cross-validation 



0.588 


0.872 

0.877 

0.874 

0.871 

0.879 

0.888 

0.898 
0.881 

0.883 

0.892 


. 0.882 
0.865 


0.831 

0.838 


0.6 


Precision 


0.8 


Figure 12. Average precision for different methods on Wang dataset 


■ Holdout validation 
Curvelet+CEDD+QCH (100 bins) 

Curvelet+CEDD+QCH (64 bins) 
Curvelet+CEDD+QCH (25 bins) 
Curvelet+CEDD+QCH (16 bins) 
Curvelet+CEDD+QCH (9 bins) 
Curvelet+CEDD 
CEDD 
Curvelet 

0.4 


i K-fold cross-validation 



198 

0.540 

0.6 


0.827 

0.838 

0.846 

0.837 


0.857 

0.863 

0.874 

0.862 

0.859 

0.877 


_ 0.863 
0.843 


0.811 

0.812 


0.8 


Recall 

Figure 13. Average recall for different methods on Wang dataset 


Moreover, the author has realized from the results that different features perform differently on the various 
datasets. Fig. 14 and Fig. 15 present the average precision of applying CEDD, curvelet and RGB color histogram 
with 9, 16 and 25 bins descriptors over all databases using K-fold and holdout validations, respectively. It is 
obvious that the system performs very well on ALOI and COIL-100 datasets compared to the Wang and Corel- 
1000 datasets, this is mainly because of the consistent background in all images in case of ALOI and COIL-100 
datasets. 

Furthermore, the author compared her retrieval method with other existing techniques. Table 3, table 4, table 5 
and table 6 report comparisons between the proposed method and a group of other techniques on the explored 
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datasets. The performance of the proposed system, evaluated in terms of precision, achieved good results and the 
proposed model is competent with all the compared models. 

The results obtained on ALOI and Corel-1000 datasets denote that the proposed system significantly 
outperforms previously addressed methods. However, for COIL-100 dataset, the method suggested in [35] shows 
comparable retrieval precision. Nevertheless, this method examines only 10 objects that were selected randomly 
from the whole dataset. 



Figure 144. Average precision using K-fold validation different extracted features over each database 


■ ALOI COIL-100 Corel-1000 «Wang 



CEDD+Curvelet+QCH(9 bins) CEDD+Curvelet+QCH(16 bins) CEDD+Curvelet+QCH(25 bins) 

Figure 155. Average precision using holdout validation for different extracted features over each database 
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Table 3 

Comparison with other methods on COIL-100 dataset 


Method 

Precision 

The proposed method (with 25 bins) 

0.998 

Kavya and Shashirekha [31] (10 random objects were only considered) 

0.86 

Velmurugan and Baboo [19] (15 random objects were only considered) 

0.88 

Mukherjee et al. [20] (10 random objects were only considered) 

0.86 

Bahri and Zouaki [32] (15 random objects were only considered) 

0.78 

Elnemr [30] 

0.93 

Khatabi et al. [35] (10 random objects were only considered) 

0.9985 


Table 4 


Comparison with other methods on ALOI dataset 


Method 

Precision 

The proposed method (with 25 bins) 

0.995 

Alkhawlani, Elmogy and Elbakry [13] (10 random objects were only considered) 

0.88 


Table 5 

Comparison with other methods on Wang dataset 


Method 

Precision 

The proposed method (with 16 bins) 

0.898 

Ali et al. [14] 

0.877 

Iakovidou et al. [15] 

0.82 

Rashno, Sadri and SadeghianNejad [17] 

0.63 

Mehmood et al. [33] 

0.84 

Vilvanathan and Rangaswamy [34] 

0.75 


Table 6 


Comparison with other methods on Corel-1000 dataset 


Method 

Precision 

The proposed method (with 4 bins) 

0.95 

Elnemr [30] 

0.88 


V. Conclusion 

This paper proposes a new CBIR technique that is based on integrating CEDD, curvelet and QCH descriptors. 
The classification stage is performed using a multiclass SVM. Generally, the precision of a CBIR system decreases 
as the number and variety of images increases in the dataset. Thus, the author assessed her proposed retrieval 
technique on benchmark databases from various domains such as to cover a wide range of different CBIR 
applications. The performance analysis is evaluated by computing the precision and recall as metrics. K-fold 
cross-validation and holdout validation are used to validate the results of implementing the various investigated 
descriptors as well as the different quantization levels. The experimental results are analyzed on the basis of 
comparing the retrieval performance of CEDD and curvelet individually and jointly as well as integrating CEDD, 
curvelet, and different QCH bins. The results indicate that merging CEDD and curvelet descriptors enhance the 
retrieval performance significantly. Furthermore, integrating them with QCH improves the performance more. 
The author also concludes that combining CEDD, curvelet and RGB color histogram with 9, 16 or 25 bins 
descriptors outperformed other examined descriptors regarding efficiency and scalability. Additionally, 
comparisons with existing CBIR techniques illustrate the effectiveness and efficiency of the proposed method. 
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Thus, the author hereby demonstrates the prospect of creating a better CBIR system with more significant feature 
sets. 

In future, the employed datasets can be increased and new classes can be added to design a generalized and 
efficient retrieval system. Furthermore, advanced techniques like deep learning may be used to develop an 
efficient system for image retrieval and annotation. 
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Abstract: Cloud computing is one of the important 
subject now- a- days in which services are given to the 
users by cloud provider. So, according to the use of the 
services users will pay to the providers. Resource 
allocation and task scheduling are important to 
manage the task in cloud environment for load 
balancing. Task scheduling is an important step to 
improve the overall performance of the cloud 
computing. Task scheduling is also essential to reduce 
power consumption and improve the profit of service 
providers by reducing processing time. So, for task 
scheduling, various “quality of service” (QoS) 
parameters are considered for reducing execution time 
and maximize throughput. For this purpose, multi¬ 
objective optimization of task scheduling is used, 
which is a sub domain of “multi-criteria decision 
making” problem. This involves more than one 
objective function that can be optimized at the same 
time. 

Keyword: Cloud computing, multi-objective, QoS 
parameter 

Introduction 

In the IT industry, Cloud computing [5] is the 
latest buzzword. With the foundations of grid 
computing, utility computing, service oriented 
architecture, virtualization and web 2.0; it is an 
emerging computing paradigm. With the 
ownership of just an Internet connection, the user 
can access all the required software, hardware, 
platform, applications, infrastructure and storage. 
A cloud is a type of parallel and distributed 
system a collection of interconnected and 
virtualized computer that are dynamically 
provisioned and presented as one or more unified 
computing resources based on SLAs established 
through negotiation between the service providers 
and consumers. In this information technology 
oriented growing market of businesses and 
organizations, cloud computing provides virtual 
resources that are dynamically scalable. It 
describes virtualized resources, software, 
platforms, applications, computations and storage 
to be scalable and provided to users instantly on 
payment for only what they use [5], 

Multi-objective Optimization 

Optimization [5] deals with the problems of 
seeking solutions over a set of possible choices to 
optimize certain criteria. They become single 
objective optimization problems, if there is only 
one criterion to be taken into consideration, which 


have been extensively studied for the past 50 
years. So, we have multi-objective optimization 
problems, if there is more than one criterion which 
must be treated simultaneously. Multiple objective 
problems arise in the design, modeling and 
planning of complex real systems in area of 
industrial production, urban transportation, capital 
budgeting, forest management, reservoir 
management, layout and landscaping of new 
cities, energy distribution, etc. It is easy to see that 
almost every important real-world decision 
problems involves multiple and conflicting 
objectives, which need to be tackled while 
respecting various constraints, leading to 
overwhelming problem complexity. The multiple 
objective problems have been receiving growing 
interest from researchers with various 
backgrounds since early 1960. There are a number 
of scholars who have made significant 
contributions to the problem. Among them Pareto 
is perhaps one of the most recognized pioneers in 
the field. 

Scheduling and scheduling criteria 

Scheduling is to allocate task to appropriate 
machine to achieve some objectives or we can say 
it determines on which machine which task should 
be executed. In traditional scheduling tasks are 
directly mapped to resources at one level, 
whereas, now a days , resources in cloud are 
scheduled at two level i.e. physical level and VM 
level which is depicted in Figure 1.There are 
mainly two types of task scheduling in cloud 
computing: static scheduling and dynamic 
scheduling. In static task scheduling, information 
of task is known before execution like execution 
time whereas in dynamic task scheduling, 
information of task is not known before execution 
[11]. In cloud environment to execute a task a user 
request for a computing resource which is allocate 
by cloud provider after finding the appropriate 
resource among existing as shown in Figure 1. 
Tasks which are submitted for execution by users 
may have different requirements like execution 
time, memory space, cost, data traffic, response 
time, etc. Also, the resources which are involved 
in cloud computing may be diverse and 
geographically dispersed. There are different 
environments in cloud: single cloud environment 
and multi-cloud environment. 
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Scheduling process [6] in cloud can be 
categorized into three stages namely- 
a. Resource discovering and filtering - 
Resource request is made by cloud user and 
submitted to service provider; service provider 
searches the suitable resources to locate them. 


b. Resource selection -Resource is selected on 
the basis of task and resource selection 
parameters. 


c. Task submission -Task is submitted to the 
selected resource. 
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Figurel :View of cloud 


Scheduling criteria 

The criteria differ with respect to service provider 
and user. Service provider wants to gain revenue, 
maximize resource utilization with minimal 
efforts, whereas, user wants his job to be executed 
with minimal cost in minimum time [10]. Figure 2 
below shows the Conceptual modeling of the 
Cloud Computing environment for Task 
Scheduling. 


A. Cloud User Preferred 

1 ) Makespan: it tells about the finishing time of 
last task. The makespan should be minimum, 
which shows the fastest execution time of a task. 

2) Cost: it is the sum of the amount paid by the 
user to provider for using individual resource. 

3) Waiting time: the time spent by a task in ready 
queue to get a chance for execution. 

4) Turnaround time (TAT): Time taken by a 
task to complete its execution after its submission 
i.e., the sum of waiting time and execution time of 
a task. 

5) Tardiness: the delay in execution of a task i.e. 
difference between finishing time and deadline of 
the task. For an optimal scheduling the tardiness 
should be zero which shows no delay in 
execution. 


6) Fairness: this shows that all tasks are getting 
equal opportunity of execution. 

7) Response time: time taken by a system to start 
responding (first response) after submission of a 
task. 

B. Cloud provider preferred 

1) Resource utilization: the resources should be 
fully utilized by keeping them as busy as possible 
to gain the maximum profit. 

2) Throughput: his represents the number of task 
completed in a per time unit. 

3) Predictability: this represents the consistency 
in the response times of task .Unpredictable 
response time may degrade the performance of 
system. 

4) Priority: To give preference to a task to finish 
it as earliest. Priority can be given on the basis of 
arrival time, execution time, deadline etc. 
Resources are provided to higher priority task to 
complete the execution. 

5) Load balancing: distribution of load among all 
the computing resources. 

6) Deadline: the time till which a task should be 
completed. 

7) Energy efficiency: Reducing the amount of 
energy used to provide any solution or service. 
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Literature Survey 

Dynamic Multi-objective task scheduling in 
Cloud Computing based on Modified particle 
swarm optimization(2015): A.I.Awad et.al[l] 
state the efficient allocation of tasks to available 
virtual machine in user level base on different 
parameters such as reliability, time, cost and load 
balancing of virtual machine. Agent used to create 
dynamic system. The proposed mathematical 
model multi-objective Load Balancing Mutation 
particle swarm optimization (MLBMPSO) is used 
to schedule and allocate tasks to resource which is 
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shown below in Figure 3. MLBMPSO considers 
two objective functions to minimize round trip 
time and total cost. Reliability can be achieved in 
system by getting task failure to allocate and 
reschedule with available resource based on load 
of virtual machine. 



Figure 3: Proposed model by author 
Figure 3[1] depicts two behaviors: 

First behaviour is responsible for 

1. Task Buffer 

2. Task and Resource Information 
Second behaviour is responsible for 

1. MLBMPSO 

2. Task Submission 

Responsive Multi-objective Load Balancing 
Transformation Using Particle Swarm 
Optimization in Cloud Environment (2016): 

VG.Ravindhren.et.al [9] states that Resource 
allocation among multiple clients has to be 
ensured as per SLAs. So, to accomplish the goals 
and achieve high performance, it is important to 
design and develop a Responsive multi-objective 
load balancing Transformation algorithm 
(RMOLBT) based on abstraction in multi cloud 
environment. The model is represented below in 
Figure 4. It is the most challenging to schedule the 
tasks along with satisfying the user’s Quality of 
Service (QoS) requirements. This paper proposes 
a wide variety of task scheduling and resource 
utilization using Particle swarm Optimization 
(PSO) in cloud environment. The results in this 
paper demonstrate the suitability of the proposed 
scheme that will increase throughput, reduce 
waiting time, reduction in missed process 
considerably and balances load among the 
physical machines in a Data centre in multi cloud 
environment. 



Figure 4: System Model for Responsive Multi objective 
Load Balance Transformation using PSO 


An Efficient Approach for Task Scheduling 
Based on Multi-Objective Genetic Algorithm in 
Cloud Computing Environment (2014): 

Sourabh Budhiraja et.al[5] state that the 
scheduling of the cloud services to the consumers 
by service providers influences the cost benefit of 
this computing paradigm. In such a scenario, tasks 
should be scheduled efficiently such that the 
execution cost and time can be reduced. In this 
paper, the author proposed an efficient approach 
for task scheduling based on Multi- Objective 
Genetic Algorithm (MOGA) shown below in 
Figure 5, which minimizes execution time and 
execution cost as well. For task scheduling, a 
Multi-Objective genetic algorithm [5] is 
implemented and the research is focused on 
crossover operators, mutation operators, selection 
operators and the Pareto solutions method. 
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Figure5: Flow Chart-Modified Genetic Algorithm(MGA)[l] 
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Multi-Objective Tasks Scheduling Algorithm 
for Cloud Computing Throughput 
Optimization(2015): Atul Vikas Lakra et.al[2] 
state that the problem is to bind set of tasks 
received by the broker to the received list of VMs, 
so that execution time of workload is reduced to 
minimal optimized time. Single objective 
scheduling algorithms have some problem. For 
example, in priority task scheduling, high priority 
tasks always get chance to execute, due to this low 
priority task have to wait for a long time. 
Sometimes low priority task gets a chance to 
execute but, if high priority tasks keep coming 
then low priority task is pre-empted and CPU is 
allocated to high priority task and this leads to 
increase in execution time of a task as well as it 
reduces the throughput of the system. Similarly, in 
First Come First Serve (FCFS) and Shortest Job 
First(SJF) ,task scheduling algorithms face 
problem in worst case scenario. These algorithms 
perform very well in the best case but in worst 
case they degrade the performance to very low 
level. So an efficient scheduling algorithm is 
required which can provide optimized 
performance in both cases. Using a proper 
scheduling algorithm implementation in broker 
improves the datacenter’s performance without 
violating service level agreements. The order of 
task submission and the VMs also influence the 
execution time of the entire workload. 


□ E=L □! □ 

-3- + 



The cloud computing architecture and the 
proposed work is shown above in Figure 6, which 
depicts that the Cloud broker is responsible for 
mediating negotiations between SaaS and cloud 
provider and such negotiation are driven by QoS 
requirements. Broker acts on behalf of SaaS for 


allocation of resources that can meet application’s 
QoS requirements. Also, Figure 7 below 
represents the algorithm of multi -objective task 
scheduling [5] proposed by the author. 

Algorithm I : MiiliL-ohjcctm'tjwk scliPcIiLliiig ftlgail iiiii 

1, Submit both VMs Hit of suectSsfitHy create! VMs in datetiititer raid task Bet to 
Broker. 

2, ft rootavfld lisl of 

1 Create a received Bat of VMs. 

4, Non-ikiiuitLdted sorting [list of task) 

!• 4— I] 

Create empty uon-dominfltffli list 
dominated bit 4-bit of task 
Initially put tnd* in the noa-dominated list 
for all 1 1 - 1 to size of task's list do 
for all j *- 0 to sm of non-dominated M do 

if iaskj duimiLates lad-;, then 
put toafcj into non dominated set 

eke 

if dominates twkj then 

pot la-ski into nun duuiitLEUcd att 

ond if 
eke 

put. ffwfe and k*k } into non dominated set 

end if 
end For 
ond fur 

Pi. Sort thfi Hat of task according to the Tion-dominatnd tusk get. 

6. Sort the YM mwm\ list, in descending order. 

T.jVOT. 

fur all i F- [) Lu Lliia aizc of task a hat du 
if j > I) then 

Bind teafa to the VA/j j++ 
if j== number of VMs then 

H 

end if 
fiul if 
end for 


Figure 7: Algorithm for multi-objective task scheduling[2] 

Multi-Target Tasks Scheduling Algorithm for 
Cloud-environment Throughput 

Optimization(2016): Shubhashree S. R et.al[7] 
proposed that proposed multi-task scheduling 
algorithm that enhances the data center execution 
without damaging SLA. The proposed algorithm 
is as appeared below in Figure 8 and Figure 9, that 
utilize non-dominating sorting algorithm for 
comprehending the multi-objective (task size, 
QOS value). After a fixed time, interval, the list 
will be updated dynamically. This algorithm will 
give the optimized throughput when compared 
with the existing algorithm. 



Figure8 :Multiobjective task scheduling and dominance 
relation 
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1, Submit both YMs list of successfully created YMs in dataoentRr and t 

Broker. 

2, Create a received list of tasks. 

3, Create a received list of VMs. 

4, Xon-domiijftted Sorting (list of task] 
if- 0 

Create empty non-dommatpd list 
dominated list <-list of task 
Initially put task } in the non-dominated list 
for all ! f- 1 to size of task’s list do 
for all j -t- 0 to size of lioLi-doniinaled list do 
if task y dominate task then 

put into non dominoed set 

if taski dominates ta&kj then 
put in ski into non dominated sot. 

end if 
ebe 

put tflsfci and tasfcj into non dominated set 

pnd if 
end ibr 
cud for 

5. Sort the list of task according to the non-dominated task set, 

6. Sort tins VM received list hi derandiug order . 

7. j f- 0 7. 

for all i f- 0 to the size of task’s list do 
if j > 0 ttipn 
Bind taski to the VMj j+f 
if ]== number of VMfi then 

H 

end if 

_„j :r 

Figure 9: Algorithm for multi-target task scheduling 
algorithm[7] 

Multi-Objective Task Scheduling in Cloud 
Computing Using an Imperialist Competitive 
Algorithm(2016): Majid Habibi.et.al[3] states 
that Since the tasks scheduling in the cloud 
computing environment and distributed systems is 
an NP-hard problem, in most cases to optimize the 
scheduling issues, the meta-heuristic methods 
inspired by nature are used rather than traditional 
or greedy methods. One of the most powerful 
meta-heuristic methods of optimization in the 
complex problems is an Imperialist Competitive 
Algorithm (ICA). Thus, in this paper, a meta¬ 
heuristic method based on ICA is provided to 
optimize the scheduling issue in the cloud 
environment. Figure 10 below depicts the pseudo 
code proposed by the author in this paper. 


Input:npop(Population-size).problem-size.ep.ff, /?, pr 
For i=l to npop do 

C''ipo&ition RandomPosition(problem-size} 

If i<=ep then 

EmpriresPopulation <- C jpo5itioil 
Else 

C w <- GetWorstSolution(E rnpire sPopulation) 

If Cost(G ipci sitio n. ) ^ G O st{G wipositkm) tlieil 

Repla c e (E nip ir e s P opulation, Cj.C^} 

Else 

C’lempire ssi go An Eamp ireffirnpir esPopnl atinn) 

End 

End 

Populaton 4- C* 

End 

E valiia t e P opn 1 ato n (Popul ati o n) 

Evalua teE inp he sP opulation(E nop iresP opulati on.Populati o n) 
Imperialistic C omp etiti oii(Emp iresP op luti on. Populati on) 

E1 iminia teWe ake stEmpire ( EmpiresPopluti on. Populati on) 


End 

Eva luteP opul ation(Popula tion) 

BestSol GetBestSolution(Pop illation) 

Return Best Sol 


Figure 10: Pseudo code for the algorithm proposed by the 
author [3] 

Multi-Objective Task Scheduling using K- 
mean Algorithm in Cloud 

Computing(2016):Vanita Dandhwani.et.al[8] 
states that K-mean clustering algorithm is used to 
create the clusters for tasks. In which for k 
clusters centroids are calculated based on multi¬ 
objectives Task length and Deadline using 
equation (1) and (2) and Centroid is calculated 
using equation (3) where minimum distance value 
is selected as centroid. 

Tl= Number of Instructions (MI)— (1) 

Dl= VMmips / T1 -—(2) 

Where TETasksize 
Dl=Deadline 

VMmips= MIPS of Average VM 

dist((x, y), (a, b» = V(x - a) 2 + (y - b) 2 -(3) 

Where x= tasksize 
y=deadline 

Figure 11 below shows the k-means multi 
objective task scheduling algorithm and resource 
selection algorithm 

a. imsjamm 

Stepl: select k points as initial centroid. 

Step2: Repeat 

Stepi: Fotmk cluster by assigning eacli point to its closest centroid. 

Step4: Recompute ie centroid for each cluster. 

Step5:Untill centroid do not change, 

B: Miummm task schediung algorithm 

Stepi: Get a list of unscheduled task. 

Step2: Create a cluster using K-mean algorithm. 

Step]: Arrange clusters in descending order (higher the centroid higher the cluster). 
Step4: Map clusters to the Best VM using Resource selection algorithm. 
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C: RESOiSCE SELECTION ALGORITHM 
Stepl : Input: Get Resource list, 

Step]: Begin i=0 

Step3: Mile clusterp] is not empty do 
Step4: Select the VM which lias maximum capacity using equation (4) [12]. 
Cl-R'Oni * Rnpi iVMbwi (4) 

Mere Prcyis the number of processors in VMi. 

Pmpi is millions of instructions per second of all processors in VMi 
Vny is the communication bandwidth ability of VMi 

Step5: Schedule the cluster and execute it, 

Step6: Update status of resorces. 

Step7: H+l 
StepS: End while 
Step9: End 

Figure 11: Algorithm proposed by the author [11] 

Multi objective Task Scheduling in Cloud 
Environment Using Nested PSO Framework 
(2015): R K Jena.et.al[4] focuses on task 
scheduling using a multi-objective nested Particle 
Swarm Optimization(TSPSO) to optimize energy 
and processing time. 



Figurel2: Task graph 

Figure 12 above represents Each vertex V in the 
DAG is associated with a value <1>, T represents 
the length of the task in Million Instruction (MI). 
The problem of this model is how to optimally 
schedule user jobs to the Processing Elements 
available in the cloud under different data center. 
All the PEs is considered homogeneous, unrelated 
and parallel. Scheduling is considered as non 
preemptive, which means that the processing of 
any task can’t be interrupted. Figure 13 represents 
the algorithm proposed by the author of this paper. 


Algorithm MOPSOO 

{ 

Initialize EzrtEina] Archive kE) E =IJ E 
Forj = I 10 M (A/ ii- the :i ae of paitide swarm) 

Initialize & UJ Initialization of each particle swarm and its velocity 
For k=] to L // L is the number of iteration 
{ 

For j = 1 to hi 

E[l] = PSO (Ji )■' ■' L [j] ls the archive for paiticle swaim S, 
Update the archive (E) of non-dominated solutions 


Select leader particle from Ac aichive (/£)■'■' 

Update velocity 
TTpditn pwihrir 

1 

Return ( Na-dmiuted sohitun.) 

} 

AljirithmPiOig) 

j Itfy . ujiioeiib die nl uf mu Imli ufj* iMilitle lUuuled lu difleieul dill itulei (Dj) , 

t= U...P 

Fw i = i to ? ffP is the number of available dab center in Cloud environment 
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Figure 13: Proposed algorithm by author [4] 

Conclusion: 

Above Literature summarizes the multi-objective 
task scheduling algorithm in one form or the 
other. As we know, single objective functions 
cannot fulfill all the criteria, e.g., if we consider 
only priority of the task , rest of the QoS factors, 
which are very important in scheduling are left 
like task length , execution time , deadline , cost 
etc. So, multi-objective task scheduling algorithm 
is important for enhancing the cloud environment 
performance. Below Tablel summarized the 
algorithms of multi-objective task scheduling 
proposed by the various authors. 
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Paper name 

Authors’ name 

QoS parameters 

Algorithm used 

Which 

parameter is 

improved 

Dynamic Multi¬ 
objective task 

scheduling in Cloud 
Computing based 
on Modified 

particle swarm 

optimization(2015) 

A.I.Awad, 

N.A.El-Hefnawy 

and 

H.M.Abdel_kader 

reliability, time, 
cost and load 
balancing of 

virtual 

machine(VM). 

multi-objective Load 
Balancing Mutation 
particle swarm 

optimization 
(MLBMPSO) 
algorithm 

Execution time 
and makespan is 
minimized 

Responsive Multi¬ 
objective Load 

Balancing 
Transformation 
Using Particle 

Swarm 

Optimization in 

Cloud 

Environment(2016): 

VG.Ravindhren 
and Dr. S. 

Ravimaran 

Job size 

PSO algorithm 

increase 
throughput, 
reduce waiting 
time, reduction in 
missed process 
considerably and 
balances load 

among the 

physical 

machines in a 
Data centre 

An Efficient 

Approach for Task 
Scheduling Based 
on Multi-Objective 
Genetic Algorithm 
in 

Cloud Computing 
Environment(2014): 

Sourabh Budhiraja, 
Dr. Dheerendra 

Singh 

Cost and size 

Multi- 

Objective Genetic 

Algorithm (MOGA) 

minimizes 
execution time 
and execution 

cost 

Multi-Obj ective 
Tasks Scheduling 
Algorithm for 

Cloud Computing 
Throughput 
Optimization(2015) 

Atul Vikas Lakraa, 
Dharmendra 

Kumar Yadav 

Size, cost 

Multi-objective task 
scheduling algorithm 

better 

performance and 
improved 
throughput, 
reduced cost 

Multi-Target Tasks 
Scheduling 

Algorithm for 

Cloud-environment 
Throughput 
Optimization(2016): 

Shubhashree S. R 

Size of task 

Non-Dominated 
Sorting algorithm 

to enhance the 
data-center 
throughput, 
diminishes the 

execution time. 

Multi-Ob j ective 

Task Scheduling in 
Cloud Computing 
Using an 

Imperialist 
Competitive 
Algorithm(2016): 

Majid Habibi, 

Nima Jafari 

Navimipour 

Task size 

Imperialist 

Competitive 

Algorithm 

Execution time 

Multi objective Task 
Scheduling in Cloud 
Environment Using 
Nested PSO 

Framework 

R K Jena 

Size of task 

multi-objective nested 
Particle Swarm 

Optimization(TSPSO) 

optimize energy 
and processing 
time. 

Multi-Ob j ective 

Task Scheduling 

Vanita Dandhwani, 
Dr.Vipul Vekariya 

Task length and 
deadline 

multi-objective task 
scheduling 

minimize the 

execution time 


150 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 




International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


using K-mean 



algorithm using k- 

and 

Algorithm in Cloud 
Computing(2016): 



mean clustering 

makespan 


Table 1 : Summary of survey 
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Abstract — with the emerging of new development in solving the 
issues related to big data and its important in computer science cannot 
be undermined. In this research paper we described briefly the 
motif occurrence and uniqueness for a significant PCSK9 
gene responsible for the generation of protein which causes 
lower cholesterol levels. Motif occurrence is very common in 
protein sequences and their detection provides a very 
important role in evaluating the function of proteins. 
Currently many motif databases are present which help us 
comparing the specific motifs with currently available 
motifs, results in the paper are compared with 11 available 
databases associated with the TOMTOM tool. Results are 
calculated using three different clusters of PCSK9 protein 
sequences. 48 different species are the members of the 
clusters used in the analysis done by the help of de novo 
algorithm used by the MEME suite. The approach can be 
considered as the novel example of the renowned problem of 
motif detection in large graphs using big data analytic 
techniques. 

Keywords-motifs; PCSK9; MEME; transcription; 
bioinformatics; Genomics 

I. Introduction 

In the era where data has raised to a huge amount, The 
emerging size of data is opening up the new possibilities for 
the data scientists in every field whether it’s Mechanics, 
Bioinformatics, Genomics, Media, Business, Computer 
sciences, Electronics, Health sciences, Telecommunication etc 
[1]. As it keeps growing it needs to be analyzed so that the 
hidden information underneath can be fetched and useful 
decisions can be made on its base. Literature shows that data 
scientists have developed many algorithms regarding data 
visualization and analysis to make it more presentable and 
interpretable for the sake of information retrieval [2]. Data 
analysis is overlapped with visual analytics because visual 


analysis has its significant place in the world of Big Data [3]. 
One of the appropriate techniques for representing data is in 
the form of graphs. The nodes in a graph can represent the 
prominent entities depending upon the type of data being 
processed. According to literature visual graph analysis has 
gained the attention of researchers in order to process various 
data formats. One of the paradigms of Big Data is that few 
algorithms provide efficient results but with limited scalability 
of data. In case of massive data sets for example protein 
sequences and DNA sequences, algorithms with more 
accuracy are required in order to obtain results. Biologists 
have been working on massive datasets using the techniques 
of Big Data analytics [4]. In large dynamic graphs often 
repetitions of patterns are occurring which show that some 
specific path is being followed repeatedly. This sub graph or 
repetitive path is known as the “Network Motif’ [5]. In case of 
protein sequences these repetitions are found abundantly thus 
motif occurrence becomes a factor of similarity. Repetitive 
patterns in proteins show the function similarity. This 
similarity holds a special importance because proteins are 
mutating with the passage of time which causes various 
changes in different species. These mutations can be compared 
thus creating connectivity which identifies the similarity 
between protein structures [6]. Random projections are used 
for the discovery of motifs with the help of an algorithm 
named PROJECTION [7]. Other than PROJECTION an 
algorithm named de novo motif finder is also currently the 
shining star in network motif detection thus showing the best 
motifs found in a sequence and their comparison with other 
motif databases. Among different gene sequences the PCSK9 
gene is very important because this gene is responsible for the 
generation of protein which lowers the level of cholesterol in 
the blood stream thus resulting into a decreased rate of cardiac 
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diseases [8]. The PCSK9 helps breaking down the low-density 
lipoproteins receptors which are carriers of cholesterol in the 
bloodstream. 


II. Related Work 

Motif detection is performed using a number of different 
algorithms using DREME, MAST, GYM, DMINDA 2 etc. The 
DREME algorithm is used for the discovery of motifs based 
on transcription factor (TF). It allows large data sets to be 
analyzed and obtains various binding motifs in sequences 
.Another program used for motif detection is GYM. GYM is 
known for the Helix-turn-Helix motif detection. Helix-turn- 
Helix motifs are among the widely studied motif structures as 
per literature [9], moreover GYM also provides 
comprehensive information on the protein sequences. Some 
approaches of artificial intelligence have also been used for 
the discovery of motifs in gene sequences, one such example 
is the MAST algorithm, which conveniently uses output from 
MEME [10] for searching databases such as SWISS-PROT 
and Genpret. MAST projects some statistical measures that 
permit a rigorous evaluation of the significance of database 
searches with individual motifs or groups of motifs [11]. 
Another facility for the detection of motifs is DMINDA, now 
known as DMINDA 2 , it’s an integrated web browser used for 
the discovery of motifs in the given sequences. The interface 
of DMINDA 2 also provides the location of a given motif in a 
sequence [12]. It provides a suite of cA-regulatory motif 
analysis functions on DNA sequences. DMINDA 2 follows 
four steps for the DNA sequences analysis: 


III. Proposed Methodology 

In this section the methodology for the sequence analysis is 
mentioned. Three clusters of sequence s having 48 different 
species are downloaded from the UniProt database. All the 
sequences were in FASTA format and unaligned. The 
sequences were than uploaded to t he MEME for motif 
discovery with specific parameters. Minimum width for 
motif detection was 6 and maximum width w as 50. The 
sequence of the analysis is represented by the given 
diagram. Starting from the outer circle and moving towards 
the target. The goal is to find a motif unique in nature. 
Identification of a unique motif means that motif can be 
studied for biologists and it has some specific functionality 
which is not present in the existing databases. 

Comparison 
with existing 
motifs 
Finding 
enriched 
Motifs 
Motif 

Discovery by 
de novo 
Obtain 
Clusters of 
PCSK9 


Figure 1: Analysis Methodology 



IV. Results And Conclusions 

This section represents results obtained from 3 different 
clusters of PCSK9 gene sequence. MEME suite was used for 
the discovery, locating enrichment and comparison of motifs. 

A. DeNovo motif detection by MEME 

The figure shown below represents the de novo motif discovery 
using the MM algorithm and d comparison of motifs using the 
TOMTOM technique across existing motif databases. 


3 very Motifs 
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Figure 2: DeNovo motif detection using MM Algorithm 


Figure 3 represents the results calculated by the MAST 
algorithm used in the MEME suite. 
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Figure 3: Initial and ending point of the identified motif using MAST Algorithm 
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Figure 4: All related motifs in network PCSK9 gene 


The figure given below shows the motif with minimum E- 
value of l.le+001 in the PCSK9 protein sequence. The 
variation in the size of the alphabets representing amino acids 
in the motif shows conservation. 


E-value: l.le+001 12J Site Count: 2 L2J Width: 9 HJ 
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Log Likelihood Ratio: 51 HI Information Content: 32.9 SI Relative Entropy: 36.6 HI Bayes Threshold: 9.41574 

Name HI Start HI p-value HI Sites HI 

1. UniRef50_Q8NBP7_shuf 369 6.30e-12 CLTNELLLSP T5F GC P LTTGVPGECP 

2. UniRef90_A8T666_shuf 65 1.33e-ll HRALVYYQRL RSG SC W AVPALAFKCT 


Figure 2: Motif with minimum e-value in PCSK9 sequence 
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B. The Bayes Threshold: 

The bayes threshold of the considered motif is calculated 
using the Bayesian optimum classifier. Our results showed 
that the naive classifier performed efficiently for detecting 
motifs with the sequence having range of 400 residues. It also 
performed well with the controlled sequences which were the 
shuffle version of the original sequences in order to observe 
motif conservation. 

Conclusion 

The approach can be considered as the novel example of the 
renowned problem of motif detection in large graphs using big 
data analytic techniques. After discovering motifs in PCSK9 
sequences the next step was determining the uniqueness and 
enrichment of the motifs. This paper describes the motif 
occurrence and uniqueness for a significant PCSK9 gene 
responsible for the generation of protein which causes lower 
cholesterol levels. Results were concluded with the fact that 
PCSK9 gene obtained from 48 different species had the 
smallest motif with length 49 which represents a significant 
amount of similarity. Further comparison with existing 
TOMTOM databases showed that only 1 motif was 
determined unique in nature thus representing the mutation 
caused by PCSK9. Few motifs had similarity with other motifs 
but similarity index was limited as compared to other motifs. 
Detected motifs can be further used for finding the binding 
pockets of the proteins thus determining its functionality. 


References 


[1] Wernicke, S., &Rasche, F. (2006). FANMOD: a tool for fast network 
motif detection. Bioinformatics, 22(9), 1152-1153. 

[2] Moses, A. M., Chiang, D. Y., &Eisen, M. B. (2003). Phylogenetic 
motif detection by expectation-maximization on evolutionary 
mixtures. In Biocomputing2004 (pp. 324-335) 

[3] Vahdatpour, A., Amini, N., &Sarrafzadeh, M. (2009, July). Toward 
Unsupervised Activity Discovery Using Multi-Dimensional Motif 
Detection in Time Series.InZ/C4/(Vol. 9, pp. 1261-1266). 

[4] Xing, E. P., Wu, W., Jordan, M. I., & Karp, R. M. (2004). LOGOS: 
a modular Bayesian model for de novo motif detection. Journal of 
Bioinformatics andComputational Biology, 2(01), 127-154.. 

[5] Wong, E., Baur, B., Quader, S., & Huang, C. H. (2011). Biological 
network motif detection: principles and practice. Briefings in 
bioinformatics, 13(2), 202-215. 

[6] Dodd, I. B., & Egan, J. B. (1990). Improved detection of helix-tum-helix 
DNA-binding motifs in protein sequences. Nucleic acids research, 
18(11), 5019-5026. 

[7] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., 
&Alon, U. (2002). Network motifs: simple building blocks of complex 
networks. Science, 298(5594), 824-827. 

[8] Dodd, I. B., & Egan, J. B. (1990). Improved detection of helix-tum-helix 
DNA-binding motifs in protein sequences. Nucleic acids research, 
18(11), 5019-5026. 

[9] Harrison, S. C., & Aggarwal, A. K. (1990). DNA recognition by proteins 
with the helix-tum-helix motif. Annual review of biochemistry, 59(1), 
933-969. 

[10] Bailey, T. L., Boden, M., Buske, F. A., Frith, M., Grant, C. E., Clementi, 
L., ... & Noble, W. S. (2009). MEME SUITE: tools for motif discovery 
and searching. Nucleic acids research, 37(suppl_2), W202-W208. 

[11] Bailey, T. L., Baker, M. E., & Elkan, C. P. (1997). An artificial 
intelligence approach to motif discovery in protein sequences: 
application to steroid dehydrogenases. The Journal of steroid 
biochemistry and molecular biology, 62(1), 29-44. 

[12] Ma, Q., Zhang, H., Mao, X., Zhou, C., Liu, B., Chen, X., & Xu, Y. 
(2014). DMINDA: an integrated web server for DNA motif 
identification and analyses. Nucleic acids research, 42(W1), W12-W19. 


154 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


A Review on Detection and Counter Measures 
of Wormhole Attack in Wireless Sensor 

Network 


Rabia Arshad, Saba Zia 

Abstract- Sensor nodes are organized to form wireless sensor network to be deployed in hostile environments. Sensor nodes 
communicate each other routing protocols. Information from source node to destination node is sent via intermediate nodes. 
Security is a major issue in WSN in present days as WSN is vulnerable to attacks that can cause damage to the functionality 
of the system. In this survey paper an attempt has been made to analyze security threats to WSN at network layer. Network 
layers is affected by many attacks e.g. Black Hole Attack, Grey Hole Attack, Wormhole Attack, out of which Wormhole 
attack is the most devastating where attacker agents make a link between two points with low latency. This paper focuses 
on some researches in detecting and preventing the wormhole attack in network layer. 

Keywords- Mobile Adhoc Network , Security, Wireless Sensor Network , Wormhole Attacks 

I. Introduction 

WSN [1-5] is composed of numerous sensor nodes that are capable of monitoring environmental conditions. Sensor 
nodes are responsible for transmitting the information in the network. Transmitting data in the WSN is a critical task 
because sensor nodes are restricted devices. Due to this reason, sensor network is susceptible to many attacks. WSNs 
have some special features that distinguish them from other networks. These characteristics are given as follows [18]: 

• Limited resources 

• Minimum battery life span 

• Self-configuration 

• Random changes in topology of network 

• Centralized approach of network control 

Security is one of the important challenges in designing the WSN. Data is vulnerable to attacks of many kinds therefore 
security measurements should be taken while designing the WSN. Many security attacks can affect the performance 
of WSN e.g. Black hole [19], Grey hole [20], Wormhole attack. Wormhole attack is the most dangerous attack to 
WSN. From this point of view, this paper briefly describes the techniques for detection and prevention of wormhole 
attack. Rest of the paper is described as follows: 

• Section 2 describes about the challenges of WSN. 

• In section 3 various attacks on WSN are summarized. 

• Section 4 covers the background of Wormhole attack 

• Section 5 describes the different types of Wormhole attack 

• In section 6, different modes of wormhole attack are discussed 

• Section 7 listed some counter measures to wormhole attack 
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II. Challenges of WSN 

According to different application requirements the following design objectives of sensor nodes are considered [16, 

17]: 

A. Low Cost Node 

Sensor nodes are deployed usually in a harsh environment in a large quantity. Also the sensor nodes are not 
reusable. Therefore, reducing the cost of the sensor nodes is an important step in network design. 

B. Low Cost Node 

Reducing the size of the sensor node reduces the power consumption as well as also the cost of sensor nodes. 
Reducing the node size is very useful in the node deployment in hostile environment. 

C. Low Energy Consumption 

Sensor nodes consume power in performing their function. Power in the sensor nodes is provided by the batteries. 
In some situations it is impossible to charge their batteries. Therefore reducing the power usage of sensor nodes is a 
crucial factor. In this way, the network lifetime can be prolonged. 

D. Scalability 

Routing protocols for the sensor networks must be scalable to different network sizes as the sensor network consist 
of thousands of sensor nodes. 

E. Reliability 

Protocols for sensor networks must include error detection and correction techniques. By these techniques, a 
protocol ensures the reliability of data delivery over some noisy link. 

F. Adaptability 

In sensor networks, any fault can occur in the network due to which a node may fail. A new node may be added 
in the network at some later stage. It is also possible that a node may move to some new place in the network (in the 
mobile network). These situations result in variations in the network topology. Therefore, the network protocols should 
be adaptive to such changes in the network. 

G. Channel Utilization 

The network protocols should efficiently use the bandwidth to improve utilization of the channel as the sensor 
networks have limited bandwidth resources. 

H. Fault Tolerance 

WSN is mostly deployed in the harsh or hostile environment. Nodes might get failed due to the harsh 
environmental conditions. Therefore, sensor nodes must have the characteristics of fault tolerant. 

I. Security 

Information in the network must be secured and prevented from malicious attacks. Thus, effective security 
techniques must be introduced in the sensor network to avoid these kinds of attacks. 
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III. Attacks on WSNs 

WSN are susceptible to a variety of attacks because of multi-hop transmission. As nodes are deployed in hostile 
environment therefore WSNs have some additional vulnerability to attacks. Table I summarizes the possible attacks 
on different layers of WSN and their possible solutions. 


TABLE I. Attacks on Different Layers on WSN and Their Solutions 


Layer 

Attacks 

Security Solutions 

Physical 

Tampering 

Tamper Proofing 



Hiding 



Encryption 

Data Link 

Jamming 

Error correction method 


Collision 

Spread Spectrum Method 

Network 

Sybil 

Authentication 


Sinkhole 

Authorization 


Wormhole 


Transport 

Packet Injection 

Packet Authentication 

Application 

Aggregation Based Attacks 

Cryptographic Approach 


IV. Network Layer Attacks and Their Effects 

Layered architecture of WSN make it more vulnerable to security attacks. Various attacks and their defensive 
techniques have been proposed in WSN. Attacks on network layer [6] are given in Table II. 


TABLE II. Network Layer Attacks In WSN and Their Effects 


Attack 

Description 

Effects 

Wormhole 

a. Require 2 or more adversaries 

b. These adversaries have better resources of 

communication between them [7]. 

• Network topology changes. 

• Packets are destructed. 

• False information for routing. 

Sybil 

A malicious node represents different identities and attracts 

the traffic [8]. 

• Disruption of WSN 

• Can be a source for other attacks 

Black Hole 

A malicious node behaves like destination node and does not 

forward the packet. 

• Throughput is decreased [7] 

• Disruption of WSN 

Sink Hole 

More complex than Black Hole [7] 

• Attracts all the traffic 

• Other attacks are also triggered. 

• Base station position is affected. 

Selective 

Forwarding 

A malicious node selectively drops the packet and does not 

forward. It acts like Black Hole [9]. 

• Message contents are modified. 

• Packet dropping 

• Resources can be exhaustive. 

False Routing 

Here the attackers creates the loops in the network by routing 

packets to false sink node [10]. 

• False messages 

• Resource exhaustion 
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V. Background and Significance of Wormhole Attack 
WSN is vulnerable to attacks of different types due to the scarcity of resources. Wormhole attack is a severe attack 
on network layer of WSN where two or more attacking agents are connected by high speed wormhole off-channel 
link [12]. Wormhole attack has two different mode of attacking i.e. ‘Hidden’ and ‘Exposed’ mode. In exposed mode 
of attack, identity of attacker is attached in the packet header while tunneling and replaying packets [11]. 

In wormhole attack, any two attackers form a tunnel to transfer data and replays this data in the network. This tunnel 
is referred to as wormhole. Wormhole attack effects the WSN tremendously. Routing techniques may be disrupted 
when routing messages are tunneled. Figure. 1 represent a scenario of wormhole attack. Packets received at node A 
are replayed via node B and vice versa. 



VI. Classification of Wormhole Attack 

Wormhole attack is very difficult to detect. It is one of the Denial of Service attacks. Wormhole attack might be 
launched by any number of nodes. It is required to categorize the wormhole attacks for detection and prevention of 
wormhole attacks. Wormholes are divided into three categories [13] i.e. Open, Closed and Half Open. Table III 
describes which nodes are visible or invisible in three types of classifications. Figure 2, 3 and 4 describe the three 
types of wormhole attack respectively where S= source, D= destination, M= Malicious Node. Classification of 
wormhole attack is based on following: 

i. Attackers are invisible/ visible 

ii. Data forwarding mechanism of wormhole nodes 

iii. Ability to hide and show the identities of nodes 


TABLE III. Nodes Description in Different wormhole Attacks 



Open 

Half 

Open 

Closed 

Source Node 

Visible 

Visible 

Visible 

Destination Node 

Visible 

Visible 

Visible 
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Malicious Nodel 

Visible 

Visible 

Invisible 

Malicious Node2 

Visible 

Invisible 

Invisible 


A. Open Wormhole 

Source node, destination node and malicious nodes are visible in open wormhole. Nodes X and Y are kept hidden 
on traversing path. In this mode, packet header also contains the attackers. All sensor nodes in the network are aware 
of the presence of malicious nodes and would represent the malicious nodes as direct neighbors. 



B. Closed Wormhole 

In this mode, source node and destination node behaves one hop away from each other and this leads to creation 
of fake neighbors. 



C. Half Open Wormhole 

The first end of malicious node is visible near source node and second end is hidden in this mode of wormhole 
attack. Contents of the data packet are not modified by attackers. Packets are rebroadcast in this type of attack because 
the attackers tunnel the packets from one malicious end to other. 
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VII. Modes of Wormhole Attacks 

Wormhole is categorized into the following types based on how many nodes are involved in establishment of 
wormhole. 

A. Wormhole Using Protocol Distortion 

This type of wormhole does not affect the network functioning too much so it is a harmless type. It is also known 
as “Rushing Attack”. Only a single node is a malicious node that distorts the routing protocol. 

B. Wormhole using Packet Relay 

In this type of attack, malicious nodes replay data in between two nodes at far distance from each other. This 
would lead to creation of fake nodes. It is also “Replay Based Attack”. 

C. Wormhole using Out-of-Band Channel 

In this type of attack, there is only one malicious node capable of high transmission. It attracts the data packets to 
traverse the route that passes from the malicious node. 

D. Wormhole using Packet Encapsulation 

In this type of attack, there are many nodes presents between any two malicious nodes. Data is encapsulated in 
between malicious nodes. 

Table IV summarizes different modes of wormhole attack and number of adversary nodes is given [13]: 


TABLE IV. Summary of wormhole Attack Modes 


Mode 

Minimum Adversary Nodes 

Protocol Distortion 

1 

Packet Relay 

1 

Out of Band Channel 

2 

Packet Encapsulation 

2 


VIII. Counter Measures of Wormhole Attack 

In this section, some important wormhole detection methods are discussed. Table V summarizes a description of 
wormhole detection methods. 

A. Statistical Analysis Method 

Song et al. proposed a mechanism for detection of wormhole based on statistical analysis of multipath routing. 
This method is useful for multipath and on-demand routing protocols in intrusion detection systems [14]. 

B. Graph Theory Method 

Graph theory method was proposed by Lazos and Poovendran [11] for detection of wormholes. In this method, 
nodes that are used have information about location and thus are called location-aware guard nodes (LAGNs). LAGNs 
use a key called “local key” between neighbors that are one hop apart. The message that is encrypted with local is not 
possible to decrypt. Therefore hashed messages are used in formation of local key to detect wormholes. Node will 
detect a number of inconsistencies in the message if there is a wormhole otherwise it will unable to response. 
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C. Hop Counting Method 

Hidden as well as exposed attacks can be detected using this method. In this method, DELPHI [14] (Delay per 
Hop Indicator) protocol is used. Length of each route and time delay for each route is calculated for identification of 
wormhole. Therefore, the route having wormhole faces a greater time delay as compared to other routes. 

D. Visualization Based Method 

In this method [11], each node computes the distance to its neighboring nodes by using received signal strength. 
According to these calculations, networks topology is calculated by the base station. Network topology is more or less 
flat in case there are no wormholes. In wormholes are present then in visualization it can be seen there is a string at 
different ends of network that pulls it. In this method, each node has to send its list of neighbors to the base station. 

E. Hardware Based Method 

A method based on directional antennas [15] was proposed by Hu and Evans. It was based on assumption that in 
absence of any wormhole, if node A sends data in particular direction than it was received at its neighbor from opposite 
direction. It is mandatory that every node must have its directional antennas. 

F. Trust Based Method 

Jain and Jain proposed another important method [14] for identification and isolation of malicious nodes that create 
wormhole in the network. In this method, trust values are calculated according to the sincerity of the nodes in the 
neighbors for execution of routing protocol. The trust value is used to effect the routing decisions which guide the 
nodes not to communicate through wormholes. Packet dropping is reduced to 14% by using the trust based methods. 
Using trust based mechanisms, throughput is also increased up to 8-9%. 


TABLE V 

Summary of Methods for wormhole Detection 

Methods 

Description 

Statistical Analysis Method 

Efficient for on demand protocols, easy integration and effective for multipath 

routing 

Graph Theory Method 

Nodes are equipped with GPS receivers 

Hop Counting Method 

Low overhead, high efficiency and fast performance 

Visualization Based Method 

Mobility is not considered in this method. It is best for dense networks 

Hardware Based Method 

Not applicable to the networks other than having directional antennas. Very 

efficient for the networks having antennas 

Trust Based Method 

Locate dependable routes in the network in an effective way 


IX. Conclusion 

In this paper, attacks to WSN are studied briefly. Wormhole attack is one of the important attacks on WSN. 
Background of wormhole attack, its cause’s and methods of prevention i.e. graph theory method, trust based method, 
and hop-counting methods etc. are discussed in this paper. Graph theory method requires GPS receivers it is location 
based method. Hardware method is the efficient method for large networks having antennas. Hop-counting method 
have high efficiency over other methods. 
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Abstract — Indoor optical wireless communication systems 
offer an attractive substitution for realizing next generation 
wireless local area network. In the context of indoor 
optical-wireless communication (OWC) system the spot- 
diffusing technique provides improved performance 
compared to conventional diffuse system. In this work, the 
performance of an OW spot-diffusing communication 
system using Neuro-Fuzzy (NF) adaptive multi-beam 
transmitter configuration has been proposed. This 
research work focus on the performance of the indoor 
optical wireless systems. Where BER of the existing 
modulation technique evaluated based on the Spot 
Diffusion Adaptive ANFIS algorithm and compared the 
existing result and analysis. A Linear error correcting code 
low-density parity-check (LDPC) scheme is introduced in 
the algorithm to observe the change in the performance of 
the system. It has been analyzed that involving LDPC in 
Spot Diffusion Adaptive ANFIS algorithm provides 
approximately 37.5% improvement of BER result against 
receiver position, where BER reduce chronologically. From 
the bit error rate analysis of these schemes it has seen that 
the system model with LDPC performs better than other 
existing techniques. 

Keyword: - Indoor optical wireless communication (OWC), 
Neuro-Fuzzy (NF) adaptive multi-beam transmitter, Spot 
Diffusion Adaptive ANFIS algorithm, Linear error 
correcting code low-density parity-check (LDPC). 

I. INTRODUCTION 

This work is focusing on the performance analysis for the 
BER of indoor optical wireless communication system using 
LDPC. Some of advantages of OWC are low cost, base¬ 
band circuit design, High data rates (Gaps), less multi 
access interference, no need to pay for spectrum license 
etc. and limitations are it can’t pass through wall, sensitive 
to blocking; limited Transmit Power etc. These research 
works have their own features, possibilities and limitation 
which are described later in the paper. However wireless 
systems have some working and performance limitation. 
Some key points are discussed below: 

Optical fiber communication transmits information through 
optical fibers is largely replaced by radio transmitter systems 
for long-haul optical data transmission. Not only telephony 
but also Internet traffic, long high-speed local area networks 
(LANs), cable TV (CATV) has been used such systems. 
OWC is defined as the use of optical frequencies to carry 
the electrical signals. Unguided visible, infrared (IR), 
or ultraviolet (UV) light to carry a signal. 

For indoor communication though infrared provides 
significant advantages as a medium but it also has some 
drawbacks. Several aspects impair the performance of 
indoor IR transmission systems. Because of such hinders 
design and implementation is not so easy using infrared. So, 
the optical wireless communication is used to overcome 
such indoor environment. 


Following table 1.1 gives a comparison between optical 
wireless and radio frequency systems. 


Criteria 

Optical wireless 

systems 

Radio systems 

Bandwidth 

Unregulated large 

Limited 

Passes 

Through 

wall 

No 

Yes 

Cost 

Low 

High 

Speed 

High 

Low 

Fading 

Free from fading 

Multipath 

fading. 

Security 

Security and 

freedom from 

spectrum regulation 
and licensing. 

Low security. 


Table 1.1: Comparission Between Optical Wireless and 
Radio System in Indoor Wireless Communication. 

This can be used for different ranges: 

• Short range (cm - m): Chip-to-Chip Interface, 

• Medium range (m - 10 m): Wireless Optical LAN, 

• Long range (km): Free-Space Optical Communications. 

However, In Optical wireless LAN where it can be used to 
illuminate the room while serving as a medium for data 
transfer and also for transferring data at high speed for long 
distance. For free space communication, it also offers high 
speed long distance data transfer using satellite. 

1. Eye safety consideration puts limit on the amount of 
optical power that should be emanated by the 
transmitter, thus limiting the coverage of an optical 
wireless system. 

2. In indoor optical wireless systems, the leading source of 
noise is ambient light, which is typically a combination 
of fluorescent light, sunlight, and incandescent light. 
Ambient light provokes shot noise due to the random 
nature of the photo-detection process. 

3. A multipath phenomenon occurs when the transmitted 
signal follows different paths on its way to the receiver 
due to its reflection by walls, ceilings and other objects. 
Channel dispersion associated with multipath 
propagation is another major issue in indoor optical 
wireless systems. Multipath phenomena can cause 
inter-symbol-interference (ISI). 

During implementing the performance of optical wireless 
system in indoor context, some problems were noted. 
Especially evaluation of BER is major factor for such 
systems. 

another technique named LDPC have used for long-haul 
data transmission using the principal of forward-error 
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correction in the optical wireless communication. Another 
work where using Neuro fuzzy technique analysis the BER 
according to the mobility of the transmitter and receiver. But 
here BER have not investigated for the respected distance 
between transmitter and receiver. 

Thus, investigating the BER for the allowing distance 
between transmitter and receiver in indoor context of optical 
wireless communication is the major issue of this proposed 
work, where reason of using the LDPC technique is to 
improve the BER performance then previous work and 
study. 

• Previous work of indoor optical wireless communication 
has analyzed mainly the concern of BER. All the 
information that are related to this proposed research 
are investigate and simulate in Matlab platform. 

• If the simulation could have been done in real 

environment, more appropriate result might have been 
measured. 

• Here considered the upward light illuminating 

measurement. The measurement in a room for the side 
wall reflection are not listed. 

• Five parameters are considered by the proposed 

research, they are SNIR (Signal -to-Noise- 
Interference), Delay Spread, BER with and without 
LDPC. There are more parameters that could have 
been compromised. 

The objectives of the study are: 

• to review on existing indoor optical wireless 

communication system improvement of BER 

• to propose a method like LDPC in this area that look up 
and analysis the BER for the system. 

• to compare the performance of proposed method with 
existing research 

All of this aim would cover the enough pretty analysis of 
BER for high-speed indoor OW applications. 

In chapter 2, a number of different proposed works are 
discussed based on research papers. Proposed system 
model has been discussed in chapter 3. Numerical analysis 
and comparison with other methods have been established 
in chapter 4. 

II. LITERATURE REVIEW 

In this chapter, there are brief knowledge about the OWC 
and BER. Many studies were introduced using this method 
for different wireless communication from time to time. Here 
many research are discussed and show comparison among 
them. 

Today’s world research heeds their interest mostly in the 
context of wireless communication. The lofty maintenance 
and configuring cost makes the wireless system financial 
and flexible alternative to wired system. People are aware of 
a proliferation of many researches and developments that 
has been done in this area specifically in Optical wireless 
communication. Further generation of wireless 
communication systems are already relay on Optical 
Wireless System technologies. Conventionally OWC 
technology depends on high power solid-state lasers or 
diode lasers for medium to long-range applications. Latterly 
we have seen the remarkable advances in semiconductor 
sources such as light emitting diodes (LEDs) in visible light 
and ultraviolet wavelengths, multi-array light sources and 
detectors, tracking and steering. With the benefits of low 
power and cost for short/medium range wireless 
communication applications such advances provide huge 
latent. The optical wireless Communication could provide a 
cost effective, flexible, secure and ultra-high-speed solution 
to the materializing challenges facing the system and 
service providers over RF (Radio frequency). 

Optical wireless communications have becoming an 
effective alternative medium to optical fiber, and radio 
frequency (RF) communications and it optimistically 
removes gradually all of the possible hinders that are 
challenging for previous communication technologies, 


because of its high bandwidth, low cost, ease of 
implementation, license free spectrum freedom from 
interference and many more. For some wireless applications 
such as 3-D face-to-face communication and super Hi- 
Vision/Ultra High Definition TV data (more than 4Gbit/s) [1] 
due to the explosive bandwidth which can be envisioned 
through optical wireless communication. Data throughput 
and transmission link based on optical wireless these are 
the great concern for number of applications [2] [3]. 

Most of the time the transmitted data it is not fully secure or 
error free. Some statistical fluctuations are related to noise 
influences (e.g. laser noise, amplifier noise, shot noise, or 
excess noise of a receiver) cause a small fraction of the 
transmitted bits to be defective. Typically, the bit error rate 
(i.e., the fraction of incorrectly transmitted bits) is strongly 
dependent on the transmitted power, and the latter must be 
high enough to keep the bit error rate below a certain 
acceptable level (e.g. 10" 12 for Earth-based 

telecommunication systems, or 10" 6 for satellite control). 
Nearly all of the remaining bit errors can be detected using 
some kind of checksums and corrected. 



Bit Rate 
Mbps 

Fig 2.1: Simple Scenario of BER in Optical Wireless 
communication 

The error correction scheme can use some level of 
redundancy in the transmitted data or involve retransmitting 
corrupted data packets. Additional detrimental influences 
such as fiber losses or various types of dispersion in a 
longer link, or background light in free-space transmission, 
can often be compensated by somewhat increasing the 
transmitted power. BER is the total number of erroneous 
bits compared to the total number of transmitted bits. 

The increase of optical power required to maintain a given 
bit error rate is called a power penalty, or more specifically 
e.g. a dispersion penalty if dispersion is the considered 
factor. The above figure 2.1 presents the scenario of BER 
related in OWC. 

For reducing the probability of loss of information LOW 
density parity check code (LDPC) is an effective error 
correcting code that used in noisy communication channel. 
By employing LDPC this probability can be reduced to as 
small as desired thus the data transmission can be as close 
to Shannon’s limit. 

Optical wireless communication is a dynamic area of 
research and development. The basic scheme of OWC 
systems communications and applications of are discussed 
in [4]. With the main focus on the indoor deployment 
scenarios, here reviewing and summarizing recent 
advancements in OW communication [5]. 

For future high performance and cost-effectiveness of 
indoor optical wireless systems and possible configurations 
for modulation, and multi-access techniques are presented 
in [6], 
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An overview of both operate term in the short- (personal and 
indoor systems) and the long-range (outdoor and hybrid) 
regimes and discussed within the context of point-to-point 
ultra-high-speed data transfer. The discussion on outdoor 
systems focuses on the impact of atmospheric effects on 
the optical channel and associated mitigation techniques 
that extend the realizable link lengths and transfer rates. [7]. 
The authors have discussed about the generic propagation 
models for the design of wireless optical communication in 
[5]. Bit error rate of optical wireless communication had 
addressed by employing a detector array in the presence of 
turbulence [8]. Pervasive work and study have done to 
improve performance of the OWC systems through BER by 
occupying several methods. These articles focused and 
discussed on the BER of wireless optical communication 
presenting different methods and approaches [9] [10] [11]. 
The artificial neural network (ANN) has some attractive 
properties like adaptability, parallel processing, and 
universal approximation. A well train ANN can perform the 
task equalization efficiently. In this paper authors have 
presented the ANN for equalization in indoor environment in 
an optical channel [12]. Another work has done where 
authors described their research about the improvement of 
the optical wireless system using Neuro-Fuzzy based Spot- 
Diffusing Techniques which provides better performance 
compared to other generic spot-beam diffusion method [13]. 
However, in optical wireless communications LDPC method 
is effective way to improve the BER (bit error rate). 

Many works have done focused on the LDPC for optical 
wireless communication. This paper investigated the 
performance of regular and irregular, random like and 
structured generalized low-density parity-check (GLDPC) 
codes for long-haul transmission [14]. The authors 
evaluated the low-density parity check (LDPC) coded 
schemes with unequal transmission power allocation 
(UTPA) in optical wireless channel [15]. This work proposed 
a method of establishing a secure and reliable 
communication link using optical wireless communication 
(OWC) [16]. Here implemented a modified Low-Density 
Parity-Check (LDPC) codec algorithm in ultraviolet (UV) 
communication system [17]. [18] Here authors analysis the 
BER of Optical Wireless Communication (OWC) System 
employing Neuro-Fuzzy (NF) based spot diffusion system. 
But this system has not yet implemented for the 
performance of BER using LDPC. However, in recent years 
the researches of LDPC for indoor optical wireless 
communication are increasing day by day. 

Now comparing the proposed work where introduce the 
LDPC methods focusing on the improvement of bit error 
rate analysis for more secure and reliable data transmission 
in indoor optical wireless communication area. 

III. SYSTEM MODEL & PROPOSED METHOD 

In this chapter, an established Spot Beam algorithm using 
LDPC has been proposed. Here, SNIR, Delay Spread, BER 
has been considered. This research is worked on the 
existing Spot Beam Selection algorithm using LDPC 
methods for high speed data transfer and reduce the 
probability of loss of information. 

An Indoor Optical Wireless System: RF (Radio 
Frequency) and Infrared are two major transmission 
technologies to gain indoor optical wireless communication 
[19]. Its performance depends on the propagation and type 
of system used. The basic systems types can be 
categorized into diffuse or line of sight (LOS) systems [6], 
[7]. To obtain high data rates such as Gbit/s can be 
achieved in LOS systems, [8], [11], but due to its 
directionality the system is vulnerable to 
blockage/shadowing. Whereas several paths from source to 
receiver exist in diffuse OW system, which makes the 
system robust to blockage/shadowing. However, the path 
losses are high and multi-paths create inter-symbol 


interference (ISI) which limits the achievable data rate [8] 
[ 111 - 

Transmission Techniques: Basic optical wireless system 
consists of a transmitter (using LEDs or LDs). Propagation 
medium like free space and the receiver (using APDs or PIN 
diodes). Some transmission techniques are described 
below: 

Directed beam infrared (DBIR) radiation: From the 
transmitter to the receiver the optical beam travels directly 
without any reflection. The optical wireless link using this 
technique is established between two fixed data terminals 
with highly directional transmitter and receiver at both ends 
of the link. Lack of mobility and susceptibility to blocking and 
shadowing by personnel and machines is the main 
drawback of this technique. 

Diffuse infrared (DFIR) radiation : The transmitters send 
optical signals in a wide angle to the ceiling and after one or 
several reflections the signals arrive at the receivers in 
DFIR. For transmission, the system does not require any 
line of sight alignment which is one of the most desirable 
configurations from a user point of view. Though, systems 
using this technique have a higher path loss than their DBIR 
counterparts, requiring higher transmitter power levels and 
receivers with larger light collection area. Multipath 
dispersion is another challenging problem in this technique. 
Radiate optical power over a wide solid angle. Thus, provide 
mobility to the receiver. 

Quasi-diffuse infrared (QDIR) radiation: In QDIR, there is a 
base station (BS) with a relatively broad coverage made of 
passive or active reflector which usually accumulated on the 
ceiling. By always maintaining the line of sight BS transmits 
(receives) the signal power to (from) the remote terminals 
(RTs) thus, the RTs cannot be fully mobile. From any 
position in the room to enable communication between itself 
and the BS the RT’s transceiver must be aimed to the BS, 
or its FOV must be wide enough. In another appearance of 
QDIR technique, the transmitter may send the optical signal 
to a designated area on the ceiling and the receiver is 
supposed to face that area. In general, this architecture 
provides a concession between the DFIR and DBIR option. 
In QDIR system inherits aspects of both point-to-point and 
diffuse links. Gradually deviating beam sources which 
illuminate a grid of spots on the ceiling. 

Channel propagation: A number of transmission 
techniques are possible for indoor optical wireless systems; 
these techniques may be classified according to the degree 
of directionality of transmitter and receiver [19]. For indoor 
OW links, the two most common configurations are LOS 
and non-LOS propagation systems. In the field of indoor 
mobile application LOS links rely upon a direct path 
between the transmitter and receiver, regardless of their 
beam angles, while non-LOS links generally rely upon light 
reflections from walls, ceilings and other diffuse reflecting 
surfaces a non-directed non-LOS link scenario which is 
often referred as a diffuse link is the most desirable 
approach. In the environments where shadowing exists 
diffuse systems also play a very significant role. One of the 
main reasons is the facts that diffuse propagation systems 
do not require transmitter-receiver alignment or line-of-sight 
and instead make use of reflections from walls, ceiling, and 
other reflectors. However, diffuse systems are subject to 
multipath dispersion which results in signal spread and inter 
symbol interference. Through the use of diversity and/or 
equalization the effects of multipath propagation can be 
reduced. 

System Model: Here we consider an empty room with floor 
dimensions of 8x 4 m z ceiling height of 3m. Where the 
reflection coefficient of the ceiling is considered to be 0.8. 
The ceiling has eight spot lights. In the Figure and x are 
the position of the imaging receiver and v is the velocity 
where a is elevation angle 5, is the azimuth angle, d = 8, w 
= 4 and h = 3. Neuro-Fuzzy (NF) adaptive multibeam 
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transmitter is located at the center of the room whereas an Delay Spread: The Doppler spread of an impulse is 
imaging receiver is placed at expressed as rms value by, 


= (1; 1; 0:5). 

With adapted beam angle (a, 5) and power that is reflected 
by the imaging receiver while transmitter generates multi 
spot beam matrix on the ceiling. Through the low rate 
diffuse channel, the transmitter learns receiver position, 
mobility. At low data rate, the beam maintains the fixed 
power. 

Analysis of SNIR and BER: The ambient light affects 
signal-to-noise-plus interference (SNIR) at the receiver in 
indoor optical-wireless communication. Many researchers 
have considered intensity modulation with direct detection 
(IM/DD) as most feasible approximation. The received 
signal, symbolize byyCtJ can be expressed as: 

yW = 

Vi2*(t) * hit «, 3) + a. 3) -h J/(t, a.3) -(1) 

Where R is the receiver responsivity, xit) is the 
instantaneous optical transmitted power, hit, or, 3.} is the 
impulse response of the OW channel, n(t ra. is the 
ambient light noise, lit a . 3) is the instantaneous 
interference power. The SNIR, denoted by y, of the received 
signal can be calculated by [9]. 

_ A 2 (P sl -P s ^Kh z 

r ~ . C2j 


Where P 5 i and P^ are the optical power associated with 
the binary 1 and binary 0 respectively, are the shot 

noise variation component with P 5 1 and P^ respectively. 
Bit Error Rate: For the non-encoded system with binary 
phase-shift-keying(BPSK), the BER expression can be 
given by: 

, . l f n/2 b bpsk r 

1 . 3 ) 

Where h blJsk =sin 2 By (2) and (3) we can write 


Tpbpsk W =^J 0 TryJ exp(- 


WAi- 






-)-( 4 ) 


Adaptive Power Allocation: The achievable data 
transmission rate, denoted by b, of the OWC system is 


given by b - log 2 (l 


+ ■ 


Cffsij-Ogiij) 2 


■J-CS] 


The optimization problem and constraint of the power 
allocation can be written as 


max b . (6j 

S.t Y/ J=l P s <F . (7) 

Where P is the average power. To analyze the above 
optimization problem, we can use the Lagrange multiplier 
method and the Lagrangian function is defined as 
L = b+ fij (Z 3 J=l Pj<P ) . (8) 

where fij is the Lagrange multiplier. After solving the Eqn. 
(8), we can write 


n _ 

\ n 


(ii) 


Where ft = and £[is the delay time and Pr is the 

received power. Now considering the Doppler shift and the 
Neuro fuzzy ANFIS model and using the algorithm for Spot 
Beam selection of this paper. 

Proposed System Method: Since the aim of this proposed 
work is to transmit correct information at longer distance 
with lower transmit power, LDPC coding scheme is used 
with coded modulation to achieve significant coding gain 
without bandwidth expansion. 

System model can be explained by using LDPC where, 
information typically in the form of digital data, is input to 
electronic circuitry that modulates the transmitting light 
source (LEDs/LDs). The source output passes through an 
optical system into the free space (propagation medium). 
The received signal also comes through the optical system 
and passes along the optical signal. The optical signal 
traverses through the indoor channel where it gets affected 
by the multipath propagation and back ground noise. The 
noise affected optical signal is detected by the photo 
detector. 



Fig. 3.1: Block Diagram for the LED based Indoor 
Communication System Model using LDPC coding with 
BPSK Modulation 

Then BPSK demodulation takes place followed by LDPC 
decoding. The BER analysis is done to estimate the 
performance of the system. 

Error Correction Coding Scheme - LDPC: LDPC codes 
are based on a parity-check matrix where each transmission 
message contains ( m , n) binary block with 2 n binary m 
tuples (m > n). Here m is block length, ri is number of data 
bits and (m - n) is number of checked bits. Code rate, £ = — 

m 

Encoding: The Parity-check matrix, H, for (7, 4) LDPC block 
code is constructed using hamming codes as constituent 
codes so that the position where error occurs can be 
detected for correction. 


77 = 


10 10 10 1 
0 110 0 11 
0 0 0 1 1 1 1 


( 12 ) 







With the associated data bits of parity locations, Generator 
matrix, G, is generated and given below: 


= max 



1 



( 10 ) 
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1 1 1 0 0 0 0 
10 0 110 0 
0 10 10 10 
110 10 0 1 


(13) 


Given a message u , the codeword c will be the product of G 
and u. 


c = G.u .(14) 

If the received codeword is v, the syndrome vector, denoted 
by z, will be 

z = H.v .(15) 


When z = 0, codeword is error free. Otherwise the value of 
z is the position of the flipped bit. 

Decoding: By sending messages through the connected 
edges between variable nodes I? and check nodes 
/icF of factor graph where B and F are the set of all 
message nodes and check nodes respectively decoding is 
accomplished. Here assume that is the message sent 
by node to m-fjbi vice-versa. Let, the probability that 

there is a 1 at each variable node is o-l where / = 1,2,3, 
■■■ n. Now, the probability that there is an even number of 
I’s on variable nodes is, 

Perm - jP r U»l©- ] 

4 & + n?,.i(l -2a bi )] . (16) 

Now, m / JPl t0) can be expressed as, 





(17) 


Therefore, m^Cl) = 1 - . .. (18) 


The variable nodes update their response messages to 
check nodes using equation (18). 




JV- 

(I-**) n («w°) 


■■. («) 


And 


m /fif (D = k b[fj 


Pi 


n (™/^ ci) ) 


ffp-fs =i 

Here is the constant to ensure 

m fjb t (0) + = 1 

P t = Prlbt = l\v] . . ( 21 ) 


( 20 ) 


At this point, fe; nodes update their decision, denoted by 
,by calculating the probabilities Mi^(0) and Afb|(l) for 
0 and 1 respectively. 

Mb t (0) = Jcb|(l - P|) and ^l( 1) 

=khi(Pim fjEF m fjb[ ax 

Therefore, decision, denoted by is given below: 

= {1 ifMti (1) > Mhi (0) else 


The algorithm will terminate if f&i satisfies parity-check 
equation, otherwise predetermined number of iteration will 
terminate it. Now Algorithm for correct data transmission is 
summarized as follows: A spot beam scans the ceiling, 
SNIR and delay spread, A tr for each beam have been 
calculated by the image receiver using Equs (2) and (9). 
Based on the required minimum SNIR, i.e., min, and 
maximum delay spread, i.e., max, transmitter selects the 
spot-beam matrix (H) by NF controller. The transmitter 
allocates the power for each selected beam adaptively using 
Eqn (7). Based on Doppler shift, the transmitter adapts the 
beam angles or and *5. Multi-spot optical transmitter further 
reduces the by scheduling. Finally, Multi-spot optical 
transmitter transmits the spot beam matrix to receiver via 
ceiling. When transmitter achieves the receiver current 
position then it transmits data u sing LDPC; after getting the 
data on the other hand decode the data using decoding 
LDPC code. Go to Step 1 if transmitter gets receiver’s 
position update. 

Algorithm: The following algorithm will find the spot beam 
with an equal power allocation over 40X20 beam hologram 
or matrix, H. 

Notation: 

y =SNIR, o' = delay spread, = azimuth angle and 
£ = elevation angle, Transmitter, R=Receiver. 

P=position of receiver. 

Initialization; 

Prol: begin 

for each beam i=1 to n do 
Calculates y and u; 

Set Ymin and V-max ! 

if y =< 7mm 17 >= a -mnx then 
Calculates cr and 
end 

Calculate Fusing ANFIS Controller; 

T sends encoded data c using LDPC to R located at 
position P; 

R decoded data c' and receives it; 
end 
end 

if P of R changes then 
Call Prol 
End 


IV. NUMERICAL ANALYSIS 

The proposed method is worked on the implanted Neuro- 
Fuzzy based multi beam system (NFMS) diversity receiver 
configuration. It has been also compared with other spot- 
beam diffusion method. The research has been done by 
employing LDPC code into the above investigated method. 
Thus, Performance for the indoor optical wireless 
communication improves and shows possible low error rate 
in the purpose of data transmission. 

Result and Discussion: For the analysis of the 
implemented algorithm, the simulation parameters those are 
considered and implanted: length, width and height of the 
room are 8m, 4m and 3 m; the reflection coefficient of the 
ceiling is p= 0:8. One transmitter which is located at (2; 4; 1) 
location; there is also one receiver; the area acceptance 
semi-angle of each photo-diode are2cm^ and G5 C 
respectively. The number of pixel at the receiver is 200 (with 
area of 0:01 cm2) Pedestrians move typically at the speed of 
1 m/s. If the SNIR is computed after 10 ,^s; there are 8 spot 
lamp in the room which are located at (1; 1; 1), (1; 3; 1),(1; 
5; 1),(1; 7; 1),(3; 1; 1), (3; 3; 1), (3; 5; 1), and (3; 7; 1); and 
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the wavelength of the light is 850nm. The results of the 
Adaptive Spot Beam Selection applying ANFIS model have 
been implanted by using this parameter. 

In this section, the numerical results are analyzed using the 
mathematical equations derived in the previous Chapter and 
are simulated using the codes written in MATLAB. The 
performance analysis is based on a scenario for a typical 
indoor environment in the presence of additive white 
Gaussian noise. The performance of the system depends 
on the receiver sensitivity. The 80 ms adaptation time will 
give overhead of 8%. Adaptation time depends on 
environment. Receiver computes the SNIR and delay 
spread and sends this information via a low rate channel to 
the transmitter. ANFIS consider two inputs. Iterative training 
of the ANFIS has been done to achieve the desired output. 
After a predefined simulation time to obtain the simulation 
result and use them to train. 



Figure 4.1: Effect of receiver position on SNIR distribution 
for without Adaptive Neuro Fuzzy System 


Another approach is placed here for the Adaptive method is 
shown in the figure below: 



Figure: 4.2: Effect of receiver position on SNIR distribution 
for Adaptive ANFIS Method and LSMS 



Fig 4.4: SNIR and BER Using BPSK Modulation 

This figure described the SNIR and BER performance Using 
BPSK Modulation technique. 


This shows the BER changes in response to the SNIR. 



Where it indicates the BER reduce in response to the SNIR. 



Fig 4.6: BER for Corresponding Receiver position 

From the figure 4.6 it has been established that while using 
LDPC coding scheme then it reflects the better 
performance. This graph describes the comparison between 
the performance for the non-coded system and coded 
(LDPC) system. In which BER consistently increases. 
Where comparatively LDPC Gives enough pretty result that 
the proposed method expected. 


From this figure, it seems that here indicates a comparison 
between proposed model of Adaptive Neuro Fuzzy Method 
and line strip multi-spot diffuse system (LSMS). The BER 
calculations were performed for the receiver moving towards 
the transmitter (i.e., the value of x is increasing) while 
neglecting the movement along y axis. Significant BER 
improvement is observed as the NFC moves the spot beam, 
selects the best positioned spot only, and allocate the power 
adaptively based on the channel condition of the selected 
slots. It is also found that the BER performances have been 
degraded as the receiver is moving. The BER increases as 
the velocity of the receiver increases. 



Fig 4.3: Effect of receiver position in Delay Spread 

This figure shows the BER comparison for proposed model 
with LSMS and conventional diffuse system. Performance 
evaluation revels that BER has been improved, if we change 
the spot beam angle adaptively. 


So, from the above simulation and analysis the study for the 
desired result has been investigated. Thus, it describes that 
the Adaptive Spot Diffusion Beam Selection algorithm if 
coded with LDPC then generates better BER performance. 

VII. CONCLUSION AND RECOMMENDATIONS 

It is possible to conclude that, in spite of the advances 
achieved so far, there is still a lot of work to be done to 
exploit completely the advantages and the potential offered 
by the optical medium. For indoor wireless system 
applications, the use of optical communication offers an 
important alternative or the growing area of mobile 
computers and communication. Thus, techniques to improve 
the operation and speeds of infrared wireless systems with 
room environments have still to found, while trying to 
decrease the cost of the systems as much as possible. 
Researcher and manufacturers are also trying to find ways 
to improve the data bit rates and the range offered by 
current systems. This work evaluates the performance of a 
LDPC coded for indoor optical wireless communication 
system to increase the low bit error rate which can be used 
for secure and reliable response communication where 
quality of service is the main concern. Simulation results 
indicated that BER performance has been improved by 
incorporating LDPC coding. This work can be extended by 
considering the mobility and BER of multiple transmitters 
and receivers using this coding scheme. 
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Abstract— This Research Paper, explored Risk management from Risk mitigation perspective. Risk Mitigation 
Perspective thoroughly examined mitigation processes, Strategies for mitigating both Negative and Positive Risks, 
discussed essence of Cost-benefit analysis for each recommended control as core mitigation process in order to select 
cost-effective and best-fit control for the Organization and stakeholders. It further emphasises importance of 
integration and updating of risk registers in the process and consistent communication/consultation among 
stakeholders including the Original Equipment Manufacturers (OEMs). 

Keywords — Risk, Risk Management, Risk Mitigation, cost-benefit analysis, Vulnerability, Opportunity 


I. Introduction 

In the face of global ever-dynamic threats and attacks, every Organization is adopting measures to reduce negative risks 
and utilize positive risks. This ensures that her vision and mission is protected, guarded and fully enhanced. This critical 
as Organizations make ICT, hub for better support and sustenance of her business. As Organizations automated their 
processes using data and communication devices, Risk Management plays a very critical role in protecting the 
organizations information assets, and therefore its mission. Risk Management is every stakeholder's duty and not only 
for the technical IT team. Therefore, it should be treated as fundamentally as an essential role of the Management. An 
effective and efficient risk management process is an important component of a successful ICT security so as to ensure 
data confidentiality, integrity and high availability. According to Project Management Institute (PMI) (2015), Risk is an 
uncertain event or condition that, if it occurs, has a positive or negative effect on one or more project objectives such as 
scope, schedule, cost, and quality. A risk may have one or more causes and if it occurs, it may have one or more impacts. 
Risk management is the process of identifying risk, assessing risk, and taking steps to reduce risk to an acceptable level, 
if possible eradicate it completely. Risk should be both net negative effect of exercise of vulnerability (weakness) 
according to stonebumer et al (2002) and net positive effect of harness of addendum opportunity (Strength) in the devices 
in order to fully maximize their utilization and functionality 

II. Risk mitigation process 

This step involves process of weighing and developing options and actions to enhance opportunities and reduce threats to 
the devices. Risk mitigation, as the second process of risk management, involves prioritizing, evaluating, and 
implementing the appropriate risk-reducing or risk-enhancing controls recommended from the risk assessment process. 
Because the elimination of all risk is usually impractical or close to impossible, it is the responsibility of principal 
stakeholders to use the least-cost approach and implement the most appropriate controls to decrease mission risk to an 
acceptable level, with minimal adverse impact on the organization’s resources and mission. Risk mitigation options, an 


170 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 




International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 15, No. 12, December 2017 


approach for control implementation, control categories, the cost-benefit analysis used to justify the implementation of 
the recommended controls, and residual risk. Of course, updating of appropriate risk registers. 



Figure 1: Risk Mitigation Process (adapted from stoneburner et al) 

Step 1: Prioritize Actions 

Based on the risk levels presented in the risk assessment report and content of risk register, the recommended 
implementation actions are prioritized. In allocating resources, top priority should be given risk items with unacceptably 
high risk rankings (e.g., risk assigned a Very High or High risk level). These vulnerability/threat pairs will require 
immediate corrective action to protect an organization’s interest and mission. 

Step 2: Evaluate Recommended Control Options 

The controls recommended in the risk assessment process may not be the most appropriate and feasible options for a 
specific organization and IT system. During this step, the feasibility (e.g., compatibility, user acceptance) and 
effectiveness (e.g., degree of protection and level of risk mitigation) of the recommended control options are analyzed. 
The objective is to select the most appropriate control option for minimizing risk. 

Step 3: Conduct Cost-Benefit Analysis 

According to Wethli, K. (2014),Cost-Benefit analysis provides one means of identifying the cases in which specific 
interventions to manage risk do appear to be cost-effective. 
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To aid management and principal stakeholders in decision-making and to identify cost-effective controls, a cost-benefit 
analysis is conducted. The objectives and method of conducting the cost-benefit analysis will be discussed in detail later 
in this work. 

Step 4: Select Control 

On the basis of the results of the cost-benefit analysis, management determines the most cost-effective control(s) for 
reducing risk to the organization’s mission. The controls selected should combine technical, operational, and 
management control elements to ensure adequate security for the IT system and the organization. 

Step 5: Assign Responsibility 

Appropriate stakeholder (in-house personnel or external contracting staff) who have the appropriate expertise and skill- 
sets to implement the selected control are identified, and responsibility is assigned by email alert. However, basic user 
awareness campaign must be carried out company-wide to educate users on what to do and how to do it, including 
sharing the in-house escalation matrix. All these are to ensure total proactivity in mitigating risks. 

Step 6: Develop a Safeguard Implementation Plan 

During this step, a safeguard implementation plan (or action plan) is developed. The plan should, at a minimum, 
contain the following information: 

a) Risks (vulnerability/threat pairs) and associated risk levels (output from risk assessment report) 

b) Recommended controls (output from risk assessment report) 

c) Prioritized actions (with priority given to items with Very High and High risk levels) 

d) Selected planned controls (determined on the basis of feasibility, effectiveness, benefits to the organization, and 
cost) 

e) Required resources for implementing the selected planned controls 

f) Lists of responsible teams and staff 

g) Start date for implementation 

h) Target completion date for implementation 

i) Maintenance requirements. 

The safeguard implementation plan prioritizes the implementation actions and projects the start and target completion 
dates. This plan will aid and expedite the risk mitigation process. 

Step 7: Implement Selected Control(s) 

Depending on individual situations, the implemented controls may lower the risk level and may not eliminate the risk 
completely. This gives room for Residual risks which are kept under close watch and monitoring in the appropriate risk 
registers 

Control Categories 

In implementing recommended controls to mitigate risk, an organization should consider technical, management, and 
operational security controls, or a combination of such controls, to maximize the effectiveness of controls for their IT 
systems and organization. Security controls, when used appropriately, can prevent, limit, or deter threat-source damage to 
an organization’s mission. The control recommendation process will involve choosing among a combination of technical, 
management, and operational controls for improving the organization’s security posture. The trade-offs that an 
organization will have to consider are illustrated by viewing the decisions involved in enforcing use of complex user 
passwords to minimize password guessing and cracking. In this case, a technical control requiring add-on security 
software may be more complex and expensive than a procedural control, but the technical control is likely to be more 
effective because the enforcement is automated by the system. 
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Residual Risks and Updating of Risk Register 

As the final output of risk mitigation process, Organizations can analyze the extent of the risk reduction generated by the 
new or enhanced (recommended) controls in terms of the reduced threat likelihood of occurrence or impact - the two 
parameters that define the mitigated level of risk to the organizational mission. And Of course, update the risk register 
either by including list of risks that have least impact when all are compared and/or de-listing the ones which have been 
triggered and considered of high impact and invariably mitigated. 

Implementation of new or enhanced controls can mitigate risk by: 

a) Eliminating some of the system’s vulnerabilities (flaws and weakness), thereby reducing the number of possible 
threat-source/vulnerability pairs or outright prevention of exercise of the vulnerabilities 

b) Adding a targeted control to reduce the capacity and motivation of a threat-source .For example, a department 
determines that the cost for installing and maintaining add-on security software for the stand-alone PC that 
stores its sensitive files is not justifiable, but that administrative and physical controls should be implemented to 
make physical access to that PC more difficult (e.g., store the PC in a locked room, with the key kept by the 
manager). 

c) Reducing the magnitude of the adverse impact (for example, limiting the extent of vulnerability or modifying 
the nature of the relationship between the IT system and the organization’s mission). 

Ill RISK MITIGATION OPTIONS FOR NEGATIVE AND POSITIVE RISKS RESPECTIVELY 

These are various options through which threats can be reduced and where possible, eradicated (in case of negative risks). 
However, in case of positive risks - opportunities, they can be enhanced or invested upon. Based on these two sides of a 
risk, its mitigation options are treated in these lights as well 

Strategies for Negative Risks or threats: There are three main strategies used to deal with threats that may lead to 
compromise of data/information integrity and confidentiality by exploiting the vulnerability in the devices; if they occur 
are: 

a) Risk Avoidance: This is used where the risk impact is high. The stakeholders act to eliminate the threats .The 
most radical avoidance strategy is to shut down the devices 

or disconnect them from network. This may prompt the stakeholders to consult the manufacturers for 
immediate solution, if there is no other alternative. 

b) Risk Transfer: Here, the stakeholders shift the impact of the threat to a third party and ownership of the 
responsibility by use of insurance, warranties, guarantees etc. 

c) Risk Mitigate: In this strategy, stakeholders act early to reduce the probability of occurrence or impact of a risk. 
Thereby making the risk to be within acceptable threshold. 

d) Risk Acceptance: This is used for Negative and Positive risks. In this scenario, stakeholders decide to 
acknowledge the risks and take no action unless the risk occurs. However this strategy provides room for 
periodic reviews of the threats to ensure that the risk does not change significantly. This also happens for the 
risks under close monitoring such as those in the risk registers. 

Strategies for Positive Risks or Opportunities: 

a) Exploit: This is used for risks with positive impacts on the devices where the stakeholders wish to ensure the 
opportunity is realized. It seeks to eliminate the uncertainty associated with a particular upside risk by 
ensuring the opportunity definitely happens. For example, engaging a vast expert to administer the devices 
who ensures that all the devices' full potential are utilized and also embraces trends of new technologies 
including their upgrades in order to proactively minimized any vulnerability and negative risks. 
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b) Enhance: This is used to increase probability and/or positive impacts of an opportunity. Identifying and 
Maximizing key drivers of this positive-impact risk may increase the probability of their occurrence. For 
example, Changing/upgrading the software (Operating systems, application etc.) and hardware of a device 
will definitely increase the throughput and security. 

c) Share: Sharing a positive risk involves allocating some or all of the ownership of the opportunity to a third 
party who is best able to capture the opportunity for the benefit of the stakeholders. For example, forming 
a risk-sharing partnership, teams or joint ventures can be established with express purpose of taking 
advantage of the opportunity so that all stakeholders gain from their actions. 

d) Accept: Accepting an opportunity is being willing to take advantage of the opportunity if it arises but not 
practically pursuing it. 

The goals and mission of an organization should be considered in selecting any of any of the options. It may not be 
practical to address all identified risks (low, medium and high), so priority should be given to the risks which adjudged to 
have potential to cause significant mission impact or harm to the stakeholders and their organizations. Therefore, the best 
of breed approach is to use appropriate technologies from various vendor security products, along with the appropriate 
risk mitigation option and other administrative measures which are best practices. 

IV. COST-BENEFIT ANALYSIS 

To allocate resources and implement cost-effective controls, organizations, after identifying all possible controls and 
evaluating their feasibility and effectiveness, should conduct a cost-benefit analysis for each recommended control to 
determine which controls are required and appropriate for their circumstances. High cost of a control does not means 
most effective and appropriate for a particular risk. 

The cost-benefit analysis can be qualitative or quantitative. Its purpose is to demonstrate that the costs of implementing 
the controls can be justified by the reduction in the level of risk. For example, the organization may not want to spend 1, 
000NGN on a control to reduce a 200NGN risk. 

A cost-benefit analysis for recommended controls encompasses the following: 

a) Determining the impact of implementing the new or enhanced recommended controls 

b) Determining the impact of not implementing the new or enhanced recommended controls 

c) Compatibility and adaptability of the recommended control to the existing ones 

d) Estimating the costs of the implementation. These may include, but are not limited to, the following: 

i. Hardware and software purchases or upgrades 

ii. Reduced operational effectiveness if system performance or functionality is reduced for increased 
security 

iii. Cost of implementing additional policies and procedures 

iv. Cost of hiring additional personnel to implement New policies, procedures, or services, if need be. 

v. Training costs 

vi. Maintenance costs and /or operational cost if applicable 

Assessing the implementation costs and benefits against system and data criticality to determine the importance to the 
organization of implementing the recommended controls, given their costs and relative impact. 

The organization will need to assess the benefits of the controls in terms of maintaining an acceptable mission posture for 
the organization. Just as there is a cost for implementing a needed control, there is a cost also for not implementing it. By 
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relating the result of not implementing the control to the mission, organizations can determine whether it is feasible to 
forgo its implementation. At this stage after considering the above caveats, it is recommended that best-and-purpose fit 
recommended control be implemented. 

V. CONCLUSIONS 

Identifying risks is important but mitigating them is much more important, as ensuring safety of stakeholders' 
investments and of course, mission protection is one of every organization's principal targets. Observing and following 
the steps as earmarked in the process is key to ensure fulfilment of risk mitigation process. In addition, cost-benefit 
analysis cannot be overlooked as it is a critical component of the entire process. Hence every recommended control or 
option must be deeply analyzed to ensure fit for purpose and further assurance of targeted results or impact. In all these, 
there should be consistent updating of stakeholders and sustenance of the communication throughout life span of the 
organization. This would further enshrine the process among stakeholders and build organization-wide proactive focus 
on mitigating high and/or critical medium risks. 
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